Cobra Reference - inline programs

Cobra Interactive Query Language inline programs

NAME

inline programs

SYNTAX

	%{ ... %}

DESCRIPTION

Cobra inline programs are enclosed in the two character delimiters:

	%{
		...
	%}

An inline program can be part of a script. Like any other sequence of commands,it can also be stored in a file and called from the command line, e.g.:

	$ cobra -f file *.[ch]

If called in this way, Cobra executes the commands listed in the file, returns any results, and stops.
If not defined in a script (i.e., between def ... end markers), the inline program is executed immediately when the closing delimiter is seen.

Inline programs can be used to write more complex queries than interactive commands allow. This includes queries that require the use of conditional selection, iteration, and commands scanning larger parts of the code to identify patterns of interest.

By default an inline program executes one for each token in the input stream, but once the program is in control it can arbitrarily change that rule, for instance by moving the current token location, reversing the direction of the search, or aborting it alltogether at any user-defined point.

Note that the standard syntax for interactive query commands cannot be used within inline cobra programs: Cobra programs have their own syntax that allows for the definition of more powerful queries.

If the broader capabilities of inline programs is also insufficient to express a query, the user can also write queries in C, and link a customized query engine with the Cobra front-end to access Cobra's data-structures. We will discuss this capability in the final section of the manual.

The following table gives and overview of all language features, with links to individual manual pages providng the details.

tokens tokens, token fields, and token references: Begin, End, first_t, last_t.

variables assignments, types, pre and post increment

expressions unary and binary operators, string matching, and regular expressions

arrays associative arrays, maps

predefined values

integer values: cpu, core, ncpu, ncore

booleans: terse, verbose, true, false

predefined functions

token processing: Next, Stop, reset(), fcts(), marks(), newtok()

string functions: hash(), itostr(), match(), split(), strlen(), and substr()

utility functions: assert(), print, restore(), save(), src_ln()

associate arrays: retrieve(), size(), unset

lists, queues, stacks: list_append(), list_push(), list_pop(), etc.

accessing pattern sets: add_pattern(), del_pattern(), is_pattern(), pset()

multicore support: a_unify(), sum(), set_ranges(), lock(), unlock()

user-defined functions the use of function parameters and recursion

keywords if, else, elif, while, break, continue, goto, for, foreach, in, function, return

comments the comment delimiter is a sequence of one or more # characters followed by a space or a tab. an # followed directly by an alphanumeric character sequence can be used as a token text reference (see Unary Operators under expressions)

Grammar

A simple example of an inline program, consider this definition:

	def prog1
	%{
		print .fnm ":" .lnr ": " .txt "\n";
	%}
	end
	prog1

This program contains a single print statement (see also fct_index) that is executed once for each token in the data structure that was built for the program being analyzed. The print command takes any number of arguments, and will print output that corresponds with the type of each argument. Numbers are printed as numbers, and strings as strings, with the special characters \n and \t interpreted as newline and tab respectively. (No other special characters are recognized though.) A sample line of output line from the above program can be:

	cobra_lib.c:1237: if

A single line of source code will typically hold multiple tokens, so the filename:linenumber combination will in most cases not uniquely identify all tokens. To identify individual tokens better you can also print the token's sequence number. So a program that prints the number and text of each token could be:

	%{
		print .seq " " .txt "\n";
	%}

(see also tokens). Statements in Cobra programs must always be terminated by a semi-colon.

A Cobra program is a sequence of statements, with all basic statements terminated with a semi-colon.
In the following description, to avoid confision we avoid the use any meta-symbols, so when [, ], *, or + symbols appear the literal symbol is always meant.

	prog:
		stmnt+
		fct_def

	stmnt:
		basic_stmnt ;
		compound_stmnt

	basic_stmnt:
		var = expr		-- assignment
		var ++			-- post-increment
		var --			-- post-decrement
		print params		-- print statement
		fct ( )			-- function call
		fct ( params )		-- function call
		goto label		-- unconditional
		label: stmnt		-- labeled statement
		return			-- in function definitions
		break			-- in while or for loops
		continue		-- in while and for loops

	compound_stmnt:
		while ( expr ) { prog }
		for ( var in array ) { prog }
		foreach ( var in name ) { prog }
		if ( expr ) { prog }
		if ( expr ) { prog } else { prog }
		if ( expr ) { prog } elif ( expr ) { prog }
		function name ( params ) { prog }
		function name ( ) { prog }

where:

	params:
		expr
		expr , params		-- one or more

	expr:
		( expr )		-- parentheses
		expr bin_op expr	-- binary operators
		prefix expr		-- eg, !@ident, see below
		number			-- integers only
		true			-- 1
		false			-- 0
		token_ref
		string
		variable
		function_call

	token_ref:
		.			-- the current token
		name			-- a token variable name
		Begin			-- first token for this core
		End			-- last token for this core
		first_t			-- first token of complete input sequence
		last_t			-- last token of complete input sequence

	string:
		"..."			-- any user-defined text string

	variable:
		. name			-- reference to a token field
		name . name		-- reference to a token field
		name			-- variable
		name [ string ]		-- associative array

	function_call:
		name ( params )		-- predefined or user-defined functions

the binary operators are:

	bin_op:
		+, -, *, /, %			-- arithmetic
		>, >=, <, <=, ==, !=, ||, &&	-- boolean

The + operator can also be used for string concatenation, for example:

	print "foo" + .txt + "goo" "\n";

Similarly, the boolean equals and unequals operators can also be used on strings:

	if (.fnm == "cobra_prep.c") { ... }

The unary prefix operators are:

	prefix:
		!			-- logical negation
		-			-- unary minus
		~			-- true if .txt contains pattern, eg ~yy
		^			-- true if .txt starts with pattern, eg ^yy
		#			-- true if .txt equals pattern, eg #yy
		@			-- true if .typ matches type, eg @ident

Note that the # symbol among the unary prefix operators requires some caution, because it also doubles as the Cobra comment delimiter. The rule is that if the # symbol is followed by a space or another # symbol, then it is interpreted as a comment. If it is immediately followed by text, it is interpreted as the prefix operator.

For more detail, see expressions.

Assignments
An assignment statement is written in the conventional way, with a single equals sign:

	lhs = expr;

The left-hand side (lhs) can be a reference to a token (e.g., .) or a token field (e.g., .mark), a variable, or an element of an associative array.
Some examples of each use are:

	.mark = 5;	# mark is the only integer token field that can be assigned to
	.mark--;	# post decrement and increment are defined
	.mark++;	# as you suspected, this is a comment
	.txt = "Foo";	# .txt is one of two text fields that can be assigned to
	.typ = q.txt + .typ;	# .typ is the other; use + for string catenation
	. = .nxt;
	. = .prv;
	. = .jmp;
	q = .;
	q = .nxt;
	. = q;
	. = q.jmp;
	A[.txt] = .len;
	val = .lnr;

Variables do not need to be declared before they are used. The type of a variable or associative array element is infered from context to be a value, a string, or a token reference.

An associative array is identified by a basename and an index in square brackets. Associative arrays can store any type of result, a value, a string, or a token reference, and they can be indexed with multiple indices separated by commas. Some examples:

	basename[index] = value;
	basename[.txt , .len] = .;

	X[.txt] = .mark;
	X[.txt]++;
	Y[.mark] = .fnm;
	Z[.fnm] = .;
	Z[.fnm , 0, "foo"] = 42;

A new value (and type) may overwrite an old one at any time.

Associative array elements that store a token reference cannot be indexed directly with a token field. To do so the element must first be assigned to a regular token reference variable, for instance as follows:

	Z[.fnm] = .;
	Z[.fnm].mark;	# gives a syntax error
	q = Z[.fnm];	# is okay
	print q.mark ":" q.txt "\n";

The number of elements in an associative array can be determined with the size function:

	v = size(Z);

Normally the elements of an associative array are retrieved simply by reference, e.g., as in:

	x = Z["foo"];

If the array element evaluated does not exist, the result will be zero (or depending on context the empty string).

We can (only) iterate over the elements of an associate array with a for statement, as follows:

	for (i in Z)
	{	print i.mark ": " i.txt " = " Z[i.txt] "\n";
	}

The loop variable i is assigned as a token reference, which allows us to refer to different parts of the array elements that are returned. The index of the associate array is converted a string and available in the text field of the loop variable: i.txt. The .mark field gives a numeric index of the array element. For technical reasons the number in the .mark field is one higher than the actual index value, which start at zero. An array element can also be retrieved directly with a numeric index with a predefined function. For instance,

	v = retrieve(Z, 0);

retrieves the first element of Z. The ordering of the elements in an associative array depend on internal implementation details, and is not related to the order in which the elements were added to the associative array.
We can also interate over the elements of a pattern set, or the tokens of a given pattern, with the foreach statement. For example:

	: pe S: while ( .* )
	: %{
		cnt = 1;
		foreach (p in S)
		{	print cnt "\n===\n";
			foreach (t in p)
			{	print t.txt " ";
			}
			print "\n";
		}
	  %}

This reproduces all patterns matched, preceded by a count. The names S and p have to be either the name of an existing pattern set or the name of the starting token of a pattern from such a set, as in the example shown.
If there is not enough information at compile time to determine if the target S is a set or a token reference, Cobra will assume the latter by default unless the qualifier 'pattern' is added directly following the 'foreach' keyword, as in:

	foreach pattern (p in S) { ... }

EXAMPLES

Example 1
The following example shows how we can match on a text string that is specified on the command-line argument to cobra itself.

	$ cat play/igrep.cobra
	def xmustbeinascript
	%{
		if (@ident && .txt == "x")
		{	print .fnm ":" .lnr ": " .txt "\n";
		}
	%}
	end
	xmustbeinascript
	$ 
	$ cobra -f play/igrep -var x=j *.c
	cobra_lib.c:1824: j 
	cobra_lib.c:1830: j 
	cobra_lib.c:1832: j 
	cobra_lib.c:1835: j 
	cobra_lib.c:1838: j 
	cobra_lib.c:2024: j 
	cobra_lib.c:2041: j 
	cobra_lib.c:2041: j 
	cobra_lib.c:2041: j
	$

Note that there can be more than one match of the token text per line of code. Line cobra_lib.c:2041 above, for instance, has three matches of a token named j:

	cobra_lib.c:2041:	for (j = 0; x && j < span; j++)

Example 2
The following example illustrates the use of a while loop and of token reference variables. The program counts the number of cases in a C switch statement, taking into account that switch statements may be nested.

	$ num stats/nr_cases.cobra
	    1	def nr_cases
	    2	%{
	    3		if (.curly > 0 && #switch)
	    4		{      	# introduce a token variable q:
	    5			q = .;
	    6			. = .nxt;
	    7			if (.txt != "(" )
	    8			{       . = q;
	    9				Next;
	   10			}
	   11			. = .jmp;
	   12			. = .nxt;
	   13			if (.txt != "{")
	   14			{       . = q;
	   15				Next;
	   16			}
	   17	
	   18			q.mark = 0;
	   19			while (.curly >= q.curly)
	   20			{	if (.curly == q.curly + 1
	   21				&&  (#case || #default))
	   22				{	q.mark++;
	   23				}
	   24				. = .nxt;
	   25			}
	   26			print q.mark " " .fnm ":" q.lnr "\n";
	   27			. = q;
	   28		}
	   29	%}
	   30	end
	   31	nr_cases

Running it produces output like this, reporting the number of cases in all switch statements, including the default cases:

	$ cobra -f stats/nr_cases cobra_lib.c | sort -n
	3 cobra_lib.c:1129 
	3 cobra_lib.c:160 
	3 cobra_lib.c:500 
	4 cobra_lib.c:2142 
	5 cobra_lib.c:993 
	6 cobra_lib.c:2109 
	10 cobra_lib.c:963 
	22 cobra_lib.c:920

A line by line explanation of this program is as follows.

Line 1 defines the code as a script, so that we can call it by name (as is done on line 31).
Line 2 and line 29 are the start en end delimiters for the cobra program.
Line 3 defines the start of a conditional execution, matching only tokens that have a nesting level for curly braces greater than zero (meaning that it only looks inside function definitions), and tokens that contain the literal text switch. Because switch is not a cobra keyword (it is a C keyword) we can match on it with a # operator. If we needed to match on text that happens to be a cobra keyword (e.g., while) then we have to write the condition differently and use: .txt == "while", where the text is parsed as a string to avoid the confusion.
Line 4 contains a comment, which starts with a # symbol. To avoid confusion with the text matching operator (e.g., used on line 3), when # is used as a text matching operator it may not be followed by a white space character (space or tab) nor by another # symbol. A comment must start with a space, a tab, or a second # symbol.
Line 5 contains an assignment to q, thus introducing a new variable name q to point to the current token location, which is represented by a single dot. Since q is a general token variable, we can also use it to refer to the other attributes of a token, even when it is not associated with a specific token location (e.g., q.mark). The only field of a token structure that can be modified though is the value of its .mark field, as is done on line 22. We can read any part of the token strucure though, as for instance done on line 26 where we refer to q.lnr.
Line 6 advances the token position forward one step. Note that this means that the implicit outer loop over all token values will also advance. The dot . in this case works just like the index variable in a for loop which can be modified within the body of the loop.
Be warned also that if you modify the token position arbitrarily, you could accidentily create an infinite loop. For instance, the cobra program:
```
		%{
			. = .prv;	# creates an infinite loop
		%}
```
Is a sure way to create an infinite loop, because the program body merely reverses the token position advance that is performed in the implied loop over all tokens that cobra performs when it executes this program once for each available token in the input files.
Line 7 checks if the current token is an open round brace. If it is not, the code jumps back on Line 8 to the location we stored in the variable q (line 5), and yields control back to cobra on line 9, so that it can advance to the next token position and repeat the program. Note that a failure to match on the symbol ( would be unusual, since in a valid C program the keyword switch is always followed by an expression in round braces.
If we reach line 11, we know that the current token is ( so we can move forward to the matching closing brace. This is down with the . = .jmp; statement. At this point we will be at the token matching ), and on line 12 we move past it to the next token position.
Line 13 checks if the token we see now is an opening curly brace. If not, we probably have a syntactically invalid C program, so we abandon the processing again, jump back to the original position remembered in variable q before yielding back control to cobra.
Line 18 now (re)initialized the value of the q.mark to zero.
The while loop on lines 19-25 now counts all tokens matching the string case, within the body of the case switch, and not counting any such tokens that may appear in lower nesting levels, e.g., belonging to nested switch statements. We can do this by looking at the nesting level as recorded in the field .curly from the current token, and comparing it to the value that we remembered in variable q for the switch statement we are currently processing. Note that q points to a location just outside the body of the switch statement, so we have to add one to the value of q.curly to get the nesting level inside the switch body.
Line 24 advances us through the code of the switch statement until the condition of the loop on line 19 no longer holds, because we have left the body of the statement.
Line 26 prints the number of case clauses seen, together with the filename and line number of the switch statement itself.
Line 27, finally, restores the token position to the one recorded in variable q, so that when cobra avances the token position for the next round of processing it will point to the statement following the token matching switch that we just processed. In this way we will be able to count also switch statements that appear in code nested inside the body of the switch statement we just completed, and get a count for these as well.

Example 3
The only main language feature that we have not discussed yet is the associative array, which can be used to associate a value, string, or token reference with a text string or a value in a named array. The following, somewhat naive, example illustrates the basic concept:

	%{
		if (#float)
		{	. = .nxt;
			if (@ident)
			{	X[.txt] = 1;
				print .fnm ":" .lnr ": declaration of '" .txt "'\n";
			}
			Next;
		}
		if (@ident && X[.txt] > 0)
		{	print .fnm ":" .lnr ": use of float '" .txt "'\n";
		}
	%}

This example uses an associative array named X to remember that we have seen the string .txt. The array associates the array element in this case with a non-zero integer value. Although in this case the right-hand side of the assignment is a value, it can also be a string, or a token reference. The value stored in X is retrieved in the condition of the second if-statement. If there turns out to be no value stored for the string specified, the value returned will be zero.
The second if-statement checks for every identifier whether the corresponding text string from .txt was recorded before. If so, we know that this identifier first appeared following the C keyword float, and must therefore be a floating point variable. For simplicity here, this version ignores that variable declarations can include multiple names separated by commas, as well as initializers.

	%{
		if (#float)
		{	. = .nxt;
			if (@ident)
			{	Store[.txt] = .;	# store the current location
				print .fnm ":" .lnr ": declaration of '" .txt "'\n";
			}
			Next;
		}
		if (@ident)
		{	q = Store[.txt];
			if (q.lnr != 0)
			{	print .fnm ":" .lnr ": use of float '" .txt "' ";
				print "declared at " q.fnm ":" q.lnr "\n";
		}	}
	%}

In this version we store and retrieve a token reference, but then need to check that the retrieved value corresponds to an actual location that was set earlier. We do so by checking the line number field, which is never zero for an actual token.

NOTES

Token variable and array references are preserved across runs of inline Cobra programs, which helps to make the following example work:

	%{
		# check the identifier length for all tokens
		# and remember the longest in q

		if (@ident && .len > q.len)
		{	q = .;
		}
	%}
	%{
		print "longest identifier: " q.txt " = " q.len " chars\n";
		Stop;	# stops the second run after the line is printed
	%}