Pattern Searches with Cobra


Why not use grep?

	$ grep -e x *.c | wc -l
	1136
	# Sample match: prefix = s;
A pattern search with Cobra, matching on lexical tokens instead of strings:
	$ cobra -pe x *.c | wc
	96
	# Sample match: strcmp(x->txt, "x"))
Note that the pattern search did not match the word prefix nor the string "x".

Pattern expressions are a simplified form or standard regular expressions, which are used in many tools for matching (parts of) text strings.

Examples include the Unix tools such as grep, sed, awk, lex, ed, sam, etc. Google search patterns can also contain regular expressions

Regular expressions define finite state automata, but you don't need to know much about regular expressions or automata to use pattern searches on code.

Some Examples

Find all tokens named x:

	$ cobra -pe x *.[ch]
Find blocks that contain a call to function malloc that is not followed by a call to free within the same block:
	$ cobra -pe '{ .* malloc ^free* }' *.c
The combination ".*" (without a space in between) is an arbitrary sequence of tokens: basically a don't care. The hat sign ^ is used for negation, that is the absence of a specific token. The star * indicates a repetition of zero or more times.

Next, suppose we want to find declarations of variables that are not used within the scope in which the variables are declared. We can catch simple cases by using name binding, and matching on types, as follows:

	$ cobra -pe '{ .* @type x:@ident ^:x* }' *.c
Here we bind the name x to the identifier that follows a type-name (like int or float, etc.). Then we can refer back to later occurrences of the same bound variable using the notation ":x".

In the following example we look for blocks of code that contain either the name static or the uppercase variant STATIC, using a bracketed sequence of token names to indicate choice:

	$ cobra -pe '{ .* [static STATIC] .* }' *.c
Cobra guarantees that in all patterns the nesting level of curly brace pairs matches, so that we indeed match on a valid block, and not just on the first occurrence of a closing curly brace.

The next example uses an embedded regular expression to match on token names within the pattern. The forward slash / is used to introduce such a regular expression, which in this case matches any token that contains an equals sign, such as >=, ==, <=, !=, etc.

	$ cobra -pe '{ .* x:@ident -> .* if ( :x /= NULL ) .* }' *.c
To match on the a forward slash / as a token itself, we would have to use an escape character (a backward slash) in front of the forward slash: \/.

For the last example we'll use a slightly more complex query, which is to find all for statements in a code base that are not followed by a compound statement (i.e., a block enclosed in curly braces), as is required in many coding standards. We can express this as follows:

	$ cobra -pe 'for ( .* ) ^[{ @cmnt for switch if]' *.c
Not that after the closing round brace of the control-part of the loop, this pattern will match anything that is not equal to either an open curly brace, a comment, or any of the keywords for, switch, or if. (The @cmnt field is of course only needed if comment tokens were in fact enabled with command-line argument -comments.)

Standard Regular Expressions

In some cases it helps to be able to match on traditional regular expressions, rather than the simplified version used in pattern expressions. Cobra supports this with the command-line flag -regex. For instance, in this expression:
	$ cobra -regex 'switch \( . \) { ( case . : .* break ; )* }' *.c
we need to use an escape character to match on a literal open and close round brace, to avoid these symbols from being interpreted as the default meta-symbols for grouping sub-expressions. The dot matches any token, as before, and the star is again used for a repetition of zero or more times. Note the spaces between the individual tokens (e.g., in the sequence ". : .* break"). This particular regular expression matches any switch statement in the code.

Meta-Symbols

Here is a summary of the meta-symbols that are used. This first set defines five characters that are meta-symbols in regular expressions, but that have no special meaning in pattern expressions:
	( and ) for grouping
	| choice, e.g. '(a | b)' matches a or b
	+ one or more repetitions
	? zero or one repetition
The next group have in most cases the same meaning in both regular and pattern expressions (but see below about the treatment of the . dot symbol when used in ranges):
	* zero or more repetitions
	. matches any token
	@type match a particular token class, e.g., @ident
	x:@type bind the variable-name x to a specific token name
	:x refer to a previously bound name

	[ and ] define a set of options, e.g., [a b c] matches one of a b or c
	a range can of course also be negated, as in ^[a b]
	note that there should be no space after the [, nor before the ]
	in a range (see below)
The following rules apply to pattern expressions:
	* and ] when preceded by a space is a regular symbol
	[ when followed by a space is a regular symbol
	/re matches a token if the token-text matches the
	regular expression re
	most token matches can be negated by preceding them
	with a ^ symbol, e.g. ^/foo* for any sequence of
	tokens except token names that contain 'foo'

Brace Pairs and the Meaning of . (dot) in Pattern Searches

When specifyin patterns like the following, using both explicit braces in combination with either a . (dot) meta-character or negated patterns:
	( .* )
	( ^@ident )
Cobra tries to guarantee that the closing brace and the opening brace match in the source text, i.e., that they appear at the same level of nesting and would be interpreted by a compiler as matching. (Supported in Cobra version 3.3 from January '21 and later.)
    Technically, in standard regular expressions, the dot symbol could match any symbol -- including a closing brace.So this deviates from the theory to provide more intuitive behavior when matching source code patterns.
A matching brace pair is treated by Cobra as an interval. These intervals will always be matched explicitly in a pattern expression if both characters (the opening brace and the closing brace) appear explicitly in the expression.

In cases where an opening brace is matched implicitly with a negation or a dot meta-character, then the closing brace must be matched in the same manner, i.e., it cannot be matched by an explicit closing brace in the patter.

To see how this works, consider the following example:

	source text: ( a ( b ) c ) ( d ) ( e )
	token pattern: ( .* )
Cobra will report 4 possible matches of the pattern here, as follows:
	( a ( b ) c )
	    ^^^^^	1
	^^^^^^^^^^^^^	2
	( d ) ( e )
	^^^^^		3
	      ^^^^^	4
Note that in the first match, the outer braces are matched explicitly and the inner braces around character b are matched implicitly. The key is that both open and close braces must be matched in the same way: either explicitly or implicitly, but not in some other combination.
Similarly:
	source: ( a b )
	token pattern: .* )
finds three matches:
	( a b )
	  ^^^^^		1
	    ^^^		2
	      ^		3
never including the opening brace in the match.

The easiest way to clearly see how the matches work is to split an input source over multiple lines (so that there is one token per line) and to use the -json command-line option to print matches in json format. Another way, when working interactively, is to display the matches with the p or pre display option.

	$ cobra -json -pe '( .* )' *.c
or
	$ cobra *.c
	: pe ( .* )
	: p
or
	: json ( .* )

Interactive Use as Queries

Pattern expressions (and regular expressions) can also be used online in interactive queries. An example is:
	$ cobra -cpp -N8 *.c	# use preprocessing and 8 cores
	8 cores 39133 files 84,111,645 tokens
	: # find if/else/if chains must end with else
	: pe else if ( .* ) { .* } ^else
	: # find non-void functions without explicit return
	: pe ^void @ident ( .* ) { ^return* }
	: q
	$
Check out also the manual page, or this online tutorial that covers these an related topics.