COBRA Tutorial

This tutorial includes examples of a fairly basic use of Cobra. More details on specific commands can be found in the online manual pages and in the Cobra Reference Manual. Lots of examples of how one can write Cobra queries can also be found in the $COBRA/rules/* subdirectories from the installation.

A Sample Cobra Session

A basic Cobra query is to quickly locate specific keywords or function calls in a C program. We can do this with a simple sequence of mark and display commands. Below, everything typed by the user is shown in bold, the system responses are in this font. The colon is the Cobra system prompt, and the initial $ symbol before a cobra command is the shell prompt from Linux or cygwin (any Unix-like shell will do).
	$ cobra -cpp cobra_lib.c
	1 core, 3 files, 21244 tokens
	: F
When invoked as shown here, with option -cpp, Cobra preprocesses the C source files that are specified on the command line. In this case that is just one .c file, but that file includes two header files as well, which the preprocessor will pull in. The tool starts by reporting the number of cores it used, the number of files it processed, and the number of lexical tokens it processed from those files. By default only one core is used. This is to make sure that the broadest possible range of queries can be handled. Many queries can be processed in parallel, using multiple cores, so in that case it is safe to start the tool with more cores, for instance by using a command-line flag -N8. Inline programs generally require a bit more caution to make sure they are multi-core safe though, but we'll cover those details elsewhere and stick to single-core use for now.

The F command shows the list of the files that were seen, which includes the two header-files that were expanded by the preprocessor. Any system header files that the preprocessor may also expand are by default not included. When needed, this can be overruled with command line option -allheaderfiles. The startup typically takes under a second per million lines of code scanned.

We can now issue interactive query commands:

	: mark unsigned
	33 matches
	: list
	  1:         global:        68  'unsigned'
	  2:         global:        54  'unsigned'
	  3:         global:        55  'unsigned'
	  4:         global:       257  'unsigned'
	  5:         global:       258  'unsigned'
	  6:      nextarg():	   370	'unsigned'
We're omitting matches 7-33 here and below, to keep the list a little shorter.
The mark command place a mark at every token that matches the pattern that follows it. After each command, the tool reports the number of matches it made.

The list command produces a numbered listing of all matches, providing the filenames, the sequence number of the match, followed by either the word global or the name of the function in which the match was found, and the source line number for each match. In this case it is clear what the token was that is being matched, but in general this may not always be obvious, so the command provides also the token string for each match.

	: display
	  1:     68  typedef unsigned char      uchar;
	  2:     54     unsigned short  tag;
	  3:     55     unsigned short  visit;
	  4:    257  typedef unsigned int       uint;
	  5:    258  typedef long unsigned int  ulong;
	  6:    370  	while (!isspace((uchar) *in) && *in != '\0')
The display command can be used to show the complete (unpreprocessed) source code line for each line matched. The reason for the match is clear in the first five cases, but may look more mysterious for match number 6. To see why that match is reported, we can use the pre command which works just like the display command but shows the preprocessed version of the code, as it was tokenized. That reveals the reason for the match:
	: pre
	  1:    68  typedef unsigned char uchar ;
	  1:                ^^^^^^^^
	  2:    54  unsigned short tag ;
	  2:        ^^^^^^^^
	  3:    55  unsigned short visit ;
	  3:        ^^^^^^^^
	  4:   257  typedef unsigned int uint ;
	  4:                ^^^^^^^^
	  5:   258  typedef long unsigned int ulong ;
	  5:                     ^^^^^^^^
	  6:   370  while ( ! ( ( * __ctype_b_loc ( ) ) \
		[ ( int ) ( ( ( uchar ) * in ) ) ] & \
		( unsigned short int ) _ISspace ) && * in != '\0' ) 
	 33:      ^^^^^^^^                          
	: q
Here we see that uchar was treated as a type, as defined on source line 68 in file cobra_fe.h, and that isspace was expanded by the preprocessor as a macro, which includes the use of a cast to type unsigned short int.

The session is now terminated with a quit command, which we abbreviated to q here.

Most commands can be abbreviated, often to just a single letter prefix if that suffices to unambiguously identify the command. We can further also use semi-colons instead of newlines to separate commands, so that the entire Cobra session above can also be written as follows, in one line.

	$ cobra -cpp cobra_lib.c
	1 core, 3 files, 21244 tokens
	: F; m unsigned; l; d; p; q
which results in the same output as for the longer version above. We can also do this from the command-line, as a quoted command sequence:
	$ cobra -cpp -c 'F; m unsigned; l; d; p' cobra_lib.c
which again produces the same output as before, while allowing us to omit also the final q, for still fewer keystrokes.

Each of the commands we've used so far supports different variants, to increase their usefulness. And there are of course quite a few additional commands that can help us do more complex types of pattern matching We discuss some of the main variations below.

MISRA Rule 16.4

Let's say we want to check that every switch statement has a default clause. This is MISRA Rule 16.4 in the 2012 guidelines for embedded software that we will use for a couple of sample queries here.
We can do a rough check with grep, for instance as follows:
	$ cat *.[ch] | grep -c -e switch
	$ cat *.[ch] | grep -c -e default
and then check if the two counts match. Clearly in this example they do not.

If the numbers printed by grep aren't the same, the challenge is to find out which switch statements do not have a default clause. Even if the numbers are equal, there is still a chance that the rule is violated, because grep will of course also report occurrences of the keywords that appear inside strings or comments, or as part of a name (which also explains why the count for default above is larger than the count for switch). We can use Cobra to perform a more accurate check.

	$ cobra -cpp *.[ch]
	1 core, 15 files, 91209 tokens
	: mark switch
	57 matches
	: reset
	: mark default
	47 matches
This already looks very different, mostly because we now accurately match the keywords in the preprocessed code, and not in comments or strings.

The numbers still do not match, so how do we find the switch statements that do not have a default clause? A few additional commands will do it:

	: reset
	: mark switch
	57 matches
	: next {
	57 matches
	: contains top no default
	10 matches
  • The reset command clears all current marks, so that we get a fresh set of matches on new patterns. Without this, if we issued another mark command, the new matches would be added to those already in place.
  • The mark command marks all keywords that match the string switch.
  • The next { command moves the mark point for each of those matches to the next occurrence of a curly brace token {, which should follow any proper switch command. If it isn't there, the match would disappear from the list, but that's not the case here.
  • Now the remaining task is to identify those bodies of switch statements that do not contain a keyword named default at the same (top) level of nesting. That is: we don't want to match on default keywords that may appear in another switch statement that may be nested somewhere inside this one, but only in the current one at the same level of nesting as the curly brace that is marked. This is achieved by giving the two qualifiers no and top to the contains command. The contains command works over ranges that by default include any matching pairs of braces, parentheses, or brackets. Since our current mark points have now been positioned on an opening curly brace, the contains command will limits its search for matches to the default range for that symbol.
So 10 of the 57 switch statements in this code do not contain a default clause. At this point we can display the line numbers for all the matches with a list command, or show the source text for each of those lines with a display command.

The display command takes parameters, for instance to restrict the display to a single match, and optionally to display more than a single line of the code around the match. For instance, to see five lines around the last (10th) match, we can say:

	: d 10 5
	 10:   1181  }
	 10:   1182  
	 10:   1183  static int
	 10:   1184  check_level(char *p)
	 10:   1185  {
	 10: > 1186  	switch (*p) {
	 10:   1187  	case '}':
	 10:   1188  		if (c_lft)
	 10:   1189  		{	if (q->curly != c_lft->curly)
	 10:   1190  			{	return 1;
	 10:   1191  			}
or to see just the ten lines from the matched source forward:
	: d 10 +10
	 10: > 1186  	switch (*p) {
	 10:   1187  	case '}':
	 10:   1188  		if (c_lft)
	 10:   1189  		{	if (q->curly != c_lft->curly)
	 10:   1190  			{	return 1;
	 10:   1191  			}
	 10:   1192  			c_lft = &matched;
	 10:   1193  			return 2;
	 10:   1194  		}
	 10:   1195  		break;
	 10:   1196  	case ')':
As a convenience, if we have the Tcl/Tk wish command installed on our system, Cobra can also pop up a window with the file source text of the file where the match is found, for individual display commands like this. (To enable these popups, execute the Cobra window command, and to disable them again execute nowindow. By default the popups are disabled.)

If we make use of the shorthands that Cobra provides for most commands, we can write the original query in a single line of text again, including the display of the matching lines at the end:

	: r; m switch; n {; c top no default; d
We can also define this sequence as a named script and store it into a file, say foo.cobra:
	$ cat foo.cobra
	def badswitch
	   r; m switch; n {; c top no default; d
We can then read in that script file with a dot command, which then allows us to call the script by name:
	: . foo.cobra
	read script 'badswitch'
	: badswitch
	   r; m switch; n {; c top no default; d
	...display of matched lines follows...
If we don't want to see the commands repeated as they are being executed, we can change the verbosity settings:
	$ cat foo.cobra
	def badswitch
	   r; m switch; n {; c top no default; d
It is often convenient to run the script immediately when it is loaded, by adding the call at the end, and then only show the number of matches at the end of a script, leaving it to the user later to display the details:
	$ cat foo.cobra
	def badswitch
	   r; m switch; n {; c top no default
	   = "Switch without default:"
Which could then be used as:
	: . foo.cobra
	Switch without default: 10 matches
	: d 1	# display the first match
Finally, if the script is defined in a file like this, we can also call it from the command line with the -f option:
	$ cobra -cpp -f foo.cobra *.[ch]
Alternatively the script can also be invoked as a command sequence, as follows.
	$ cobra -cpp -c '. foo.cobra; d' *.[ch]

Quick Checks

Let's open up the cobra session again and do some simple checks. From here on we'll use the shorthands for most commands. When in doubt, we can use the '?' help command from Cobra to see a list of all possible queries and commands.
	$ cobra *.[ch]	# by default without preprocessing
	1 core, 15 files, 90063 tokens
	: m /.			# how many tokens are there?
	90063 matches		# oh, we already knew this
We used a mark command with a regular expression that basically checks if there is at least one character in the string that corresponds to each token. The . in the regular expression matches any character. Cobra uses the standard Unix/Linux regular expression matching algorithms (i.e., regcomp, and regexec) so there are no surprises there.

What if we want to see how many statements there are?

	: r; m \;
	6993 matches
This check counts everything that is terminated by a semicolon, which is a slight over-estimation of the number of statements, since it will also count semi-colons that appear in declarations. We have to escape the semi-colon character here, to avoid that it is interpreted as a command separator. Note also that this does gives a more accurate count of semi-colons than if we were to execute, for instance:
	$ cat *.[ch] | grep -c -e ";"
Since the grep command counts lines that have one or more semi-colons. A for statement, for instance, typically has at least two semi-colons on the same line. How many for and while statements are there in the code?
	$ cobra -c 'm for; r; m while' *.[ch]
	244 matches
	127 matches
How many of those while statements are do...while() and how many are while()... ? One quick way is to count 'do' keywords of course:
	$ cobra -c 'm do' *.[ch]
	30 matches
But that's too easy. How about we check what the token is after the closing brace that follows the while keyword? if it is a semi-colon or a comma, then we also know that it was a do...while() construct. We'll use an interactive session for this:
	$ cobra *.[ch]
	1 core, 15 files, 90063 tokens
	: m while
	127 matches
	: n \(; j; n
	127 matches
	127 matches
	127 matches
	: m & /[\;,]
	30 matches
In the second set, we used a jump command (j) to jump from one side of the range that is defined by default for an opening parenthesis to the opposite end: the matching closing parenthesis. The next token after that is either a semi-colon, a comma, or something else. We then use a regular expression and an 'and' mark command (mark &) to find the subset of the marks that does have either a semi-colon or a comma following the condition.

This means that indeed 30 of the 127 while statements were of the type do...while().

If we enable or disable preprocessing and repeat some of these commands we may get different numbers, as will the number of tokens and files that are scanned. In our case, some while statements are defined in macros, with the macro expansions only visible when preprocessing is enabled.

If you're mystified by the difference between matches in preprocessed and unpreprocessed code, here's a way to figure out why the matches are different, using a few more Cobra commands:

	$ cobra *.[ch]	# no preprocessing by default
	1 core, 15 files, 90063 tokens
	: m while
	127 matches
	: track start file1 # redirect output
	: list
	: track stop        # end redirection
	: cpp on
	preprocessing enabled
	1 core, 15 files, 92249 tokens
	: r; m while
	: track start file2
	: list
	: track stop
	: !diff file1 file2	# a shell escape
	: q

MISRA Rule 15.2

It gets more interesting if we want to check something that would be impossible with straight grep commands. How, for instance, could we find goto statements that jump to a label that precedes the jump. Some coding standards allow only forward jumps (including MISRA coding guidelines, Rule 15.2), so this may be a useful thing to check. Here's how we can do this. First we find the goto statements:
	: r; m goto
	72 matches
now we advance the mark point from the goto to the labelname. That's easy, since it must follow the keyword directly:
	: next
	72 matches
The number of matches shouldn't change of course. If you're curious you can now list the precise names that were matched with a 'list' command. Doing so can be useful as we are developing new scripts and navigate the code to find the right things to match and check. But let's move forward. We are going to save the current set of matches in a set numbered 1 with a redirection command:
	: >1
Set 1 now contains the full set of all goto jumps in the code, marked by the name of the labels that are targeted by the jump. Some of these label names will appear in the code before the goto, and some will appear after it, and our task is now to distinguish these two cases. We will do so here by searching forwards from each label name to a possible point in the code where the same name appears again followed by a colon. That will be true only for forward jumps. We use the stretch command for this.
	: stretch $$ :
	61 matches
What the stretch command actually tries to do is to define a range for the matched tokens that starts at the current mark and extends to the pattern that is specified in the argument(s). In this case that argument is the currently matched token, represented by $$, followed by a colon. Note that we cannot simply put a specific label name here, because each current mark-point may carry a different labelname. We don't actually need the ranges that are created, we just want to identify those jumps for which it is possible to create the range.
We see that the number of matches went down to 61, which means that for 11 jumps the range could not be created. Those are the backward jumps that we are interested in. But we now have only the forward jumps matched. How do we get the others? We store the current set of matches in a second set:
	: >2
We have now defined two sets: one with all jumps, and a second with only the forward jumps. We'd like to take the difference of the two sets. This is done as follows. First we clear the current marks and reload the marks from set 1.
	: r; <1
	72 matches
Now we subtract set 2 to find only those marks that are not shared between the two sets.
	: <^2
	11 matches
And yes, there are the 11 backward jumps we were hoping to catch. We can now use a list or display command to see them all. Note that this check will identify all backward jumps even if a given label name is used in both forward and backward jumps.

We can summarize the entire check in one command line, using shorthands for the commands:

 	$ cobra -c 'm goto; n; >1; s $$ :; >2; r; <1; <^2; d' *.[ch]
One final point should be noted here. When we do the forward search for label names with the stretch command, we ask the tool to just search forward in the text from the location of each goto statement we marked. It is possible of course that a pseudo-match is found on a label of the right name that appears outside the current function. This can happen if the same label-name is used in different functions. This means that we could overestimate the number of forward jumps. We can fix that by making the search more precise, but to do so we'd need to use Cobra's inline scripting language, which we'll discuss elsewhere.

Cscope-like Queries

If you've used tools like cscope, it's not hard to find ways to do the same types of queries with Cobra, as the following table illustrates. Each Cobra query in the table could be prefixed with a reset command (r) to clear earlier matches, and followed with a display command (d), to show what the new matches are.

Query TypesLong FormShort Form
find a C keyword or symbol mark name m name
find a global definition mark name
mark & (.curly == 0)
m name; m & (!.curly)
find functions called by or calling a function context name context name
find a word in a string mark @str
mark & /word
m @str; m & /word
find an egrep pattern mark /pattern m /pattern
find a file mark (.fnm == file)
or: B file
m (.fnm==file)
find a filename matching a regular expression mark (.fnm == /re) m (.fnm==/re)
find files that #include file cpp off; mark @cpp
mark & /file
cpp off; m @cpp; m & /file
find assignments to a symbol mark symbol = m symbol =
or more complex:
m =; b \;; s =; m ir /symbol

Finding Patterns

Suppose we are looking for a coding pattern that spans more than a sequence of just two or three tokens. We can match those patterns with a few successive queries. If, for instance, we want to find the sequence:
	* identifier ( ... ) = ... 
indicating a function call, where the result is immediately dereferenced and assigned a value.
We can use a patterns based search, either on the command-line or interactively, for instance as follows (pe = pattern expression):
	$ cobra -pe '* @ident ( .* ) =' *.[ch]
In an interactive session, that would look like this (not, no quotes are needed in this case around the pattern, but they're also no forbidden):
	: pe * @ident ( .* ) =
We can also proceed more slowly, step by step as follows:
	: m @ident \(	# find all identifiers followed by (
	: b		# move back one token
	: m & *		# match only if this is a *
	: n \(		# forward to the (
	: j		# skip to the matching )
	: e =		# match only if the next token is =
The last match can be made more general by also including any operation that implies an assignment, for instance:
	: >1		# save results from the last operation
	: u		# undo the effect of the last mark operation
	: m & /^[-+*/%]=$
	: <&1		# add the earlier results
	: >1		# save the new larger set of marks
	: u		# undo the last mark operation
	: m & /^[-+][-+]$	# mark also postfix -- and ++
	: <&1		# add back the earlier set
	: d		# display the results

Other MISRA Rules

Here are a few other examples Cobra queries that are based on rules from the MISRA 2012 guidelines that do not have a ready equivalent in other query tools. Each example is given in shorthand form.

Simple Pattern Rules

Some of the rules are trivial to check for, for instance if they require only a match of a particular type of token. Some examples of this, where none of the mark commands should produce matches:
	: m pragma		 # Rule  1.2 language extensions should not be used
	: m malloc		 # Dir   4.12 dynamic memory allocation shall not be used
	: m realloc		 # Dir   4.12
	: m calloc		 # Dir   4.12
	: m alloca		 # Dir   4.12
	: m /alloc		 # catch all 4 of the above with a regular expression
	: m ,			 # Rule 12.3 the comma operator should not be used
	: m goto		 # Rule 15.1 gotos should not be used
	: m int; m short; m long # Dir   4.6 typedefs indicating size and signedness
				 # should be used in place of basic numerical types

Dir 4.4: sections of code should not be commented out

This rule says that they should not be any code in comments. It's of course hard to come up with a general pattern for catching any type of comment, but we could get close by checking for the use of semi-colons, which are less common in standard prose. To scan comments we must leave preprocessing disabled, and preserve the comments.

To find all standard C comment delimiters we can use a regular expression. The first forward slash in the following mark command identifies the pattern as a regular expression. A sample match is shown below.

	$ cobra -comments *.[ch] # no preprocessing (the default) and include comments
	: m //\*		# find comments, also multi-line
	4203 matches		# there are 4,203 /* ... */ comments
	: l 1			# list the details of the first match
	1: main.c:1337	'/* register as a slice criterion */'
Note that the entire comment is treated as one single token, even if in the source text it contains spans multiple lines.

We can extend the pattern to find statement separators within the comment, for instance as follows:

	: r			# clear the earlier marks
	: m //\*.*\;		# statement separator in comment
	119 matches
	: l 20
	  20: main.c:1167	'/* add_ltl[4] = fm; */'
In a similar way we can find nested comments (MISRA Rule 3.1):
	: r; m //\*.*/\*	# C   style nested comments
	: r; m //\*.*//		# C++ style nested comments

Rule 17.2: no recursive function calls

	: m @ident (; n; j; e {; j; b	# mark all fct definitions
	: s top }			# set range to fct body (stretch command)
	: m ir $$			# match of fct name inside the range
	: e (				# make sure its a fct call (extend command)

Rule 17.4: all exit paths from a function with non-void return type shall have an explicit return statement with an expression

A starting point for developing a checker can be:
	: m @ident \(; n; j	# fct definitions
	: extend {; j; b; b	# move to typename before fct name
	: unmark void		# no need to check if void
	: n {			# move to start of fct body
	: c no return		# should not match anything
	: u			# undo the c command
	: m ir return		# match all return statements inside range
	: n			# move to the token that follows, which should not be a semicolon
	: m no \;		# should not match anything

Rule 13.6: the operand of the sizeof operator shall not contain any expression with potential side effects

	: m sizeof; n; c /[-=\+\(]

Rule 3.2: line-splicing shall not be used in // comments

	: m \//.*\

Rule 20.1: there should be no code before an #include directive

	: cpp off		# no preprocessing (the default)
	: m /^\#.*include	# mark all include directives
	: b \;			# move back to a preceding semi-colon, indicating code
Of course, this works only for a single file, but not when a larger set of files is scanned which are tokenized in sequence. In that case, we want to make sure that the semi-colon matched appears in the same file as the include directive, which can be done with the scripting language. An example of such a query is given in the cobra/rules/play directory in the file code_before_include.cobra.

Rule 20.3: all #includes should be either <...> or "..."

To find violations of this rule we can again use a regular expression, as follows:
	: m /^\#.*include.*[^"<]*$


Cobra supports interactive use by building a fast internal data structure for the source code that is explored. This tradeoff does of course also bring limitations. More powerful analysis engines are needed to solve queries that rely on complex data-flow or control-flow analysis. In the cobra/rules directory there are examples of how we can still do reasonable approximations of such checks, using the cobra scripting language, which are useful for fast query checking but may not catch all possible cases, especially cases that would require accurate pointer alias analyses.

It is also possible to build standalone checkers that are linked to the Cobra front-end to construct more detailed data structures for resolving more complex static analysis queries. A range of examples of such queries is provided in the cobra/src_app directory of the distribution.

Last Updated: 19 March 2021