Cobra Interactive Query Language inline programs

NAME

inline programs

SYNTAX

	%{ ... %}

DESCRIPTION

Cobra inline programs are enclosed in the two character delimiters:
	%{
		...
	%}
An inline program can be part of a script. Like any other sequence of commands,it can also be stored in a file and called from the command line, e.g.:
	$ cobra -f file *.[ch]
If called in this way, Cobra executes the commands listed in the file, returns any results, and stops.
If not defined in a script (i.e., between def ... end markers), the inline program is executed immediately when the closing delimiter is seen.

Inline programs can be used to write more complex queries than interactive commands allow. This includes queries that require the use of conditional selection, iteration, and commands scanning larger parts of the code to identify patterns of interest.

By default an inline program executes one for each token in the input stream, but once the program is in control it can arbitrarily change that rule, for instance by moving the current token location, reversing the direction of the search, or aborting it alltogether at any user-defined point.

Note that the standard syntax for interactive query commands cannot be used within inline cobra programs: Cobra programs have their own syntax that allows for the definition of more powerful queries.

    If the broader capabilities of inline programs is also insufficient to express a query, the user can also write queries in C, and linke a customized query engine with the Cobra front-end to access Cobra's data-structures. We will discuss this capability in the final section of the ../manual.
As a first example of an inline program, consider this definition:
	def prog1
	%{
		print .fnm ":" .lnr ": " .txt "\n";
	%}
	end
	prog1
This program contains a single print statement that is executed once for each token in the data structure that was built for the program being analyzed. The print command takes any number of arguments, and will print output that corresponds with the type of each argument. Numbers are printed as numbers, and strings as strings, with the special characters \n and \t interpreted as newline and tab respectively. (No other special characters are recognized though.) A sample line of output line from the above program can be:
	cobra_lib.c:1237: if
A single line of source code will typically hold multiple tokens, so the filename:linenumber combination will in most cases not uniquely identify all tokens. To identify individual tokens better you can also print the token's sequence number. So a (rather unhelpful) program that prints the number and text of each token could be:
	%{
		print .seq " " .txt "\n";
	%}
Statements in Cobra programs must always be terminated by a semi-colon.

Three types of fields of tokens can be referred to (not just by the print command but anywhere in a cobra program), depending on the type of value that they return:

	strings:
		.fct		# function name
		.fnm		# file name
		.txt		# token text
		.typ		# token type
		.fct		# name of containing function, or "global"

	numbers:
		.round		# nesting level of ()
		.bracket	# nesting level of []
		.curly		# nesting level of {}
		.len		# length of token text
		.lnr		# linenumber
		.mark		# user-definable integer value
		.seq		# token sequence number
		.range		# the nr lines in the associated range

	tokens:
		.nxt		# the immediately following token
		.prv		# the immediately preceding token
		.jmp		# move to other end of range, eg from { to } or back
		.bound		# link to bound symbol reference
The dot symbol that is used here refers to the current token being processed. A token position can also be assigned to a variable, and then the token fields can be referred to using that variable name. For instance:
	if (.txt == "{")
	{	q = .jmp;	# q points to the matching } token
		r = q.jmp;	# r should now point back at the {
		assert(r == .);	# aborts the program if false
	}
Note though that only one level of dereferencing is supported, so:
	r = .nxt.nxt;		# gives syntax error
	r = .nxt; r = r.nxt;	# is okay
The following keywords can be used for defining the control-flow of an inline program, with the usual semantics.
	if
	else
	while
	break
	continue
	goto
	for
	in
	function
	return
The for keyword can be used to loop over the indices of an associative arrays (discussed shortly), as in:
	for ( varname in arrayname ) { ... }
The function keyword can be used to define new functions, e.g. as in:
	function name(par1, par2) {
		. = .nxt;
		if (.txt == par1)
		{	return par2;
		}
		return 0;
	}
The remaining keywords in the language are:
	Begin		-- the first token in the range (e.g., for a given cpu core)
	End		-- the last  token in the range (for a cpu core)
	first_t		-- the first token in the entire token sequence
	last_t		-- the last  token in the entire token sequence
Normally, in single-core executions, the value of Begin and first_t will be the same, as the values of End and last_t. But, this is different in multi-core runs, when the token sequence is split between the runs. The following example illustrates how in a multi-core run we can still force a single cpu to traverse all tokens from the original sequence:
	%{
		if (cpu != 0)		# assuming ncore > 1
		{	Stop;		# stop looping over tokens
		}
		. = first_t;		# override the default setting
		while (. != last_t)	# and define our own explicit loop
		{	# do something
			. = .nxt;	# advance to the next token
		}
		Stop;			# were're done
	%}
There are two meta-commands to modify the default token-processing:
	Next		-- end the processing of the current token and proceed to the next
	Stop		-- abort the processing of tokens and end program execution
We already used the Stop command above. The Next can be used to shortcut the program and make Cobra advance to the next token in the original input sequence to repeat. For instance:
	%{
		if ([email protected])		# if the current token is not an identifier
		{	Next;		# move on
		}			# else
		.mark = 1;		# mark it
	%}
	= "Number of identifiers:"
Predefined functions:
	assert(expr)	-- to check the truth of an expression
	print args	-- the print statement discussed earlier
	newtok()	-- to create a new lexical token, for scratch values
The newtok function is usefule to create a placeholder token, without modifying any existing token.
For instance:
	%{
		if (. == Begin)
		{	q = newtok();
			q.txt = "foobar";
		}
		...
	%}
String functions:
	match(s1, s2)	-- true if string s1 matches s2, where s2 can be a regular expression
	strlen(s)	-- returns the length of string s
	substr(s, n, m)	-- returns the m-character substring of s starting at n
For instance:
	%{
		if (match(.txt, "/[Yy][Yy]")	# regex
		{	# matches if .txt contains YY yy Yy or yY
			print .fnm ":" .lnr " " .txt "\n";
		}
		if (match(.fnm, "//usr")	# regex
		{	# matches if the filename contains /usr
			.mark++;
		}
		if (match(.fnm, "\/usr")	# not a regex
		{	# matches if the filename equals /usr
			cnt++;
		}
		if (.txt == "/usr")		# not a regex
		{	# matches if the filename equals /usr
			first_t.mark++;
		}
	%}
Support for variables and associative arrays:
	retrieve(A, n)	-- retrieves the nth element of associative array A
	size(A)		-- returns the number of elements stored in array A
	unset A[el]	-- remove associative array element A[el]
	unset A		-- remove variable or array A
Booleans:
	terse		-- the (read-only) value of the externally set display mode
A typical use of the latter could be:
	%{
		if (@ident)
		{	.mark++;
			if (!terse && ncore == 1)
			{	print .fnm ":" .lnr " " .txt "\n";
		}	}
	%}
Concurrency related:
	cpu or core	-- the id of the current cpu core (0..Ncore)
	ncpu or ncore	-- the number of available cpu cores (Ncore)
	a_unify(...)	-- see section Multi-Core
	sum(A[el])	-- add up the values of integer array elements across cores
	set_ranges(from, to) -- see section Multi-Core
	lock()		-- set a mutual exclusion lock between cores
	unlock()	-- release the lock
The number of cores used for processing Cobra programs is by default set to be 4 on systems with at least that many cores. It can also be set explicitly with a commandline argument. For instance, to use 16 cores:
	$ cobra -N16 -f prog.cobra *.[ch]
And it can be modified interactively with a query command, eg:
	: ncore=16
It cannot be modified inside an inline program though.

For more detail on the multi-core language see section Multi-Core.

Statements

A little more detail about the structure of Cobra programs. A Cobra program is a sequence of statements, with all basic statements terminated with a semi-colon.
In the following description, to avoid confision we avoid the use any meta-symbols, so when [, ], *, or + symbols appear the literal symbol is always meant.

	prog:
		stmnt+
		fct_def

	stmnt:
		basic_stmnt ;
		compound_stmnt

	basic_stmnt:
		var = expr		-- assignment
		var ++			-- post-increment
		var --			-- post-decrement
		print params		-- print statement
		fct ( )			-- function call
		fct ( params )		-- function call
		goto label		-- unconditional
		label: stmnt		-- labeled statement
		return			-- in function definitions
		break			-- in while or for loops
		continue		-- in while and for loops

	compound_stmnt:
		while ( expr ) { prog }
		for ( var in array ) { prog }
		if ( expr ) { prog }
		if ( expr ) { prog } else { prog }
		function name ( params ) { prog }
		function name ( ) { prog }
where:
	params:
		expr
		expr , params		-- one or more

	expr:
		( expr )		-- parentheses
		expr bin_op expr	-- binary operators
		prefix expr		-- eg, [email protected], see below
		number			-- integers only
		true			-- 1
		false			-- 0
		token_ref
		string
		variable
		function_call

	token_ref:
		.			-- the current token
		name			-- a token variable name
		Begin			-- first token
		End			-- last token

	string:
		"..."			-- any user-defined text string

	variable:
		. name			-- reference to a token field
		name . name		-- reference to a token field
		name			-- variable
		name [ string ]		-- associative array

	function:
		name ( params )		-- predefined or user-defined functions
the binary operators are:
	bin_op:
		+, -, *, /, %			-- arithmetic
		>, >=, <, <=, ==, !=, ||, &&	-- boolean
The + operator can also be used for string concatenation, for example:
	print "foo" + .txt + "goo" "\n";
Similarly, the boolean equals and unequals operators can also be used on strings:
	if (.fnm == "cobra_prep.c") { ... }
The unary prefix operators are:
	prefix:
		!			-- logical negation
		-			-- unary minus
		~			-- true if .txt contains pattern, eg ~yy
		^			-- true if .txt starts with pattern, eg ^yy
		#			-- true if .txt equals pattern, eg #yy
		@			-- true if .typ matches type, eg @ident
Note that the # symbol among the unary prefix operators requires some caution, because it also doubles as the Cobra comment delimiter. The rule is that if the # symbol is followed by a space or another # symbol, then it is interpreted as a comment. If it is immediately followed by text, it is interpreted as the prefix operator.

Assignments

An assignment statement is written in the conventional way, with a single equals sign:

	lhs = expr;
The left-hand side (lhs) can be a reference to a token (e.g., .) or a token field (e.g., .mark), a variable, or an element of an associative array.
Some examples of each use are:
	.mark = 5;	# mark is the only integer token field that can be assigned to
	.mark--;	# post decrement and increment are defined
	.mark++;	# as you suspected, this is a comment
	.txt = "Foo";	# .txt is one of two text fields that can be assigned to
	.typ = q.txt + .typ;	# .typ is the other; use + for string catenation
	. = .nxt;
	. = .prv;
	. = .jmp;
	q = .;
	q = .nxt;
	. = q;
	. = q.jmp;
	A[.txt] = .len;
	val = .lnr;
Variables do not need to be declared before they are used. The type of a variable or associative array element is infered from context to be a value, a string, or a token reference.

An associative array is identified by a basename and an index in square brackets. Associative arrays can store any type of result, a value, a string, or a token reference, and they can be indexed with multiple indices separated by commas. Some examples:

	basename[index] = value;
	basename[.txt , .len] = .;

	X[.txt] = .mark;
	X[.txt]++;
	Y[.mark] = .fnm;
	Z[.fnm] = .;
	Z[.fnm , 0, "foo"] = 42;
A new value (and type) may overwrite an old one at any time.

Associative array elements that store a token reference cannot be indexed directly with a token field. To do so the element must first be assigned to a regular token reference variable, for instance as follows:

	Z[.fnm] = .;
	Z[.fnm].mark;	# gives a syntax error
	q = Z[.fnm];	# is okay
	print q.mark ":" q.txt "\n";
The number of elements in an associative array can be determined with the size function:
	v = size(Z);
Normally the elements of an associative array are retrieved simply by reference, e.g., as in:
	x = Z["foo"];
If the array element evaluated does not exist, the result will be zero (or depending on context the empty string).

We can iterate over the elements of an associate array with a for statement, as follows:

	for (i in Z)
	{	print i.mark ": " i.txt " = " Z[i.txt] "\n";
	}
The loop variable i is assigned as a token reference, which allows us to refer to different parts of the array elements that are returned. The index of the associate array is converted a string and available in the text field of the loop variable: i.txt. The .mark field gives a numeric index of the array element. For technical reasons the number in the .mark field is one higher than the actual index value, which start at zero. An array element can also be retrieved directly with a numeric index with a predefined function. For instance,
	v = retrieve(Z, 0);
retrieves the first element of Z. The ordering of the elements in an associative array depend on internal implementation details, and is not related to the order in which the elements were added to the associative array.

Multi-Core

Some of the less obvious language featuers for use in multi-core runs of Cobra include:

	a_unify(...)	-- see section Multi-Core
	sum(varname)    -- add up all the values of an integer variable across cores
	sum(A[el])	-- add up the values of integer array elements across cores
	set_ranges(from, to) -- see section Multi-Core
A simple example of the working of the sum function for a simple variable, when running with four cores:
	: %{ if (@ident) { x++; } %}
	: %{ y = sum(x); print x " " y "\n"; Stop; %}
	5652 23020 
	5690 23020 
	5213 23020 
	6465 23020 
	: 
this will of course fail if the variable holds anything other than integer values.

Another example that illustrates the use of the a_unify and sum functions is the following Cobra program that can be used to find the 10 most frequently occuring trigrams of token types in the input.

	%{
		q = .nxt;
		r = q.nxt;
		if (.typ != "" && q.typ != "" && r.typ != "")
		{	Trigram[.typ, q.typ, r.typ]++;
		}
	%}
	track start _tmp_
	%{
		if (cpu != 0)
		{	Stop;
		}
		a_unify(0);
		for (i in Trigram)
		{	print i.txt "\t" sum(Trigram[i.txt]) "\n";
		}
		Stop;
	%}
	track stop
	!sort -k2 -n < _tmp_ | tail -10; rm -f _tmp_
At the time of writing, applying this Cobra program to the Cobra sources produces the following output:
	$ cobra -f play/trigram *.[ch]
	ident,oper,chr 		209 
	const_int,oper,ident 	231 
	oper,oper,ident 	232 
	storage,type,oper 	239 
	key,const_int,oper 	250 
	storage,type,ident 	702 
	ident,oper,const_int 	1000 
	type,oper,ident 	1298 
	oper,ident,oper 	3541 
	ident,oper,ident 	5695 
The processing uses two inline program fragments. The first collects the data, which is assumed to be performed in a multi-core run. It will also work correctly if only a single cpu is used of course, but in that case we could use a simpler version of the code as well.

Before the second program fragment is executed we divert the output to a file called _tmp_ with a track command. We conclude the diversion after the fragment has executed, and then use a regular extern sort and tail command to process the data, after which the temporary tracking file is deleted.

The second program fragment is used to process the data that was collected by the multiple cpus before -- each cpu working on a different portion of the input sequence. The cores store the data they collect in private copies of the associative array, to avoid race conditions or the chance of data corruption with simultaneous access to the same fields of the array. This means that we must unify the contents of the associative array to make all elements visible to a given cpu. We select the cpu numbered 0 for this, and halt all other cpus at the start of the second fragment.

Cpu 0 then calls a_unify, passing it its cpu number 0. Once this call completes, we cpu 0 has access to all the array indices of Trigram, no matter which core added that array index. Importantly though, the values stored at the indices are not collected. Note that the value stored could be a token reference, a string, or a number, so it isn't possible to define a uniform way to unify all that data. In this case, we know the array elements are used to store integer counts, and we want to add up all those counts. That is done with the call to the predefined function sum.

Finally, the use of the function set_ranges(from, to) in combination with newtok() allows us to write Cobra programs that can completely (re)define all or some of the input sequence. For instance:

	%{
		a = newtok(); a.txt = "2";
		b = newtok(); b.txt = "+"; a.typ = "oper";
		c = newtok(); c.txt = "2";
		a.nxt = b;
		b.nxt = c;
		set_ranges(a, c);
		Stop;
	%}
	%{
		print .txt "\n";
	%}
If we run this program on arbitrary other input, the result printed would be:
	2
	+
	2
Note that to be able to execute this first part of this program the input sequence must contain at least one token. The program then executes for that token, replaces the input token sequence with its own, and stops. If there is no token, the program cannot execute.

We conclude this overview of the Cobra inline query language with some examples.

Example 1

The following example shows how we can match on a text string that is specified on the command-line argument to cobra itself.
	$ cat play/igrep.cobra
	%{
		if (@ident && .txt == Name)
		{	print .fnm ":" .lnr ": " .txt "\n";
		}
	%}
	$ cobra -f play/igrep -var Name=j *.c
	cobra_lib.c:1824: j 
	cobra_lib.c:1830: j 
	cobra_lib.c:1832: j 
	cobra_lib.c:1835: j 
	cobra_lib.c:1838: j 
	cobra_lib.c:2024: j 
	cobra_lib.c:2041: j 
	cobra_lib.c:2041: j 
	cobra_lib.c:2041: j
	$
Note that there can be more than one match of the token text per line of code. Line cobra_lib.c:2041 above, for instance, has three matches of a token named j:
	cobra_lib.c:2041:	for (j = 0; x && j < span; j++)

Example 2

The following example illustrates the use of a while loop and of token reference variables. The program counts the number of cases in a C switch statement, taking into account that switch statements may be nested.
	$ num stats/nr_cases.cobra
	    1	def nr_cases
	    2	%{
	    3		if (.curly > 0 && #switch)
	    4		{      	# introduce a token variable q:
	    5			q = .;
	    6			. = .nxt;
	    7			if (.txt != "(" )
	    8			{       . = q;
	    9				Next;
	   10			}
	   11			. = .jmp;
	   12			. = .nxt;
	   13			if (.txt != "{")
	   14			{       . = q;
	   15				Next;
	   16			}
	   17	
	   18			q.mark = 0;
	   19			while (.curly >= q.curly)
	   20			{	if (.curly == q.curly + 1
	   21				&&  (#case || #default))
	   22				{	q.mark++;
	   23				}
	   24				. = .nxt;
	   25			}
	   26			print q.mark " " .fnm ":" q.lnr "\n";
	   27			. = q;
	   28		}
	   29	%}
	   30	end
	   31	nr_cases
Running it produces output like this, reporting the number of cases in all switch statements, including the default cases:
	$ cobra -f stats/nr_cases cobra_lib.c | sort -n
	3 cobra_lib.c:1129 
	3 cobra_lib.c:160 
	3 cobra_lib.c:500 
	4 cobra_lib.c:2142 
	5 cobra_lib.c:993 
	6 cobra_lib.c:2109 
	10 cobra_lib.c:963 
	22 cobra_lib.c:920 
A line by line explanation of this program is as follows.
  • Line 1 defines the code as a script, so that we can call it by name (as is done on line 31).
  • Line 2 and line 29 are the start en end delimiters for the cobra program.
  • Line 3 defines the start of a conditional execution, matching only tokens that have a nesting level for curly braces greater than zero (meaning that it only looks inside function definitions), and tokens that contain the literal text switch. Because switch is not a cobra keyword (it is a C keyword) we can match on it with a # operator. If we needed to match on text that happens to be a cobra keyword (e.g., while) then we have to write the condition differently and use: .txt == "while", where the text is parsed as a string to avoid the confusion.
  • Line 4 contains a comment, which starts with a # symbol. To avoid confusion with the text matching operator (e.g., used on line 3), when # is used as a text matching operator it may not be followed by a white space character (space or tab) nor by another # symbol. A comment must start with a space, a tab, or a second # symbol.
  • Line 5 contains an assignment to q, thus introducing a new variable name q to point to the current token location, which is represented by a single dot. Since q is a general token variable, we can also use it to refer to the other attributes of a token, even when it is not associated with a specific token location (e.g., q.mark). The only field of a token structure that can be modified though is the value of its .mark field, as is done on line 22. We can read any part of the token strucure though, as for instance done on line 26 where we refer to q.lnr.
  • Line 6 advances the token position forward one step. Note that this means that the implicit outer loop over all token values will also advance. The dot . in this case works just like the index variable in a for loop which can be modified within the body of the loop.
    Be warned also that if you modify the token position arbitrarily, you could accidentily create an infinite loop. For instance, the cobra program:
    		%{
    			. = .prv;	# creates an infinite loop
    		%}
    
    Is a sure way to create an infinite loop, because the program body merely reverses the token position advance that is performed in the implied loop over all tokens that cobra performs when it executes this program once for each available token in the input files.
  • Line 7 checks if the current token is an open round brace. If it is not, the code jumps back on Line 8 to the location we stored in the variable q (line 5), and yields control back to cobra on line 9, so that it can advance to the next token position and repeat the program. Note that a failure to match on the symbol ( would be unusual, since in a valid C program the keyword switch is always followed by an expression in round braces.
  • If we reach line 11, we know that the current token is ( so we can move forward to the matching closing brace. This is down with the . = .jmp; statement. At this point we will be at the token matching ), and on line 12 we move past it to the next token position.
  • Line 13 checks if the token we see now is an opening curly brace. If not, we probably have a syntactically invalid C program, so we abandon the processing again, jump back to the original position remembered in variable q before yielding back control to cobra.
  • Line 18 now (re)initialized the value of the q.mark to zero.
  • The while loop on lines 19-25 now counts all tokens matching the string case, within the body of the case switch, and not counting any such tokens that may appear in lower nesting levels, e.g., belonging to nested switch statements. We can do this by looking at the nesting level as recorded in the field .curly from the current token, and comparing it to the value that we remembered in variable q for the switch statement we are currently processing. Note that q points to a location just outside the body of the switch statement, so we have to add one to the value of q.curly to get the nesting level inside the switch body.
  • Line 24 advances us through the code of the switch statement until the condition of the loop on line 19 no longer holds, because we have left the body of the statement.
  • Line 26 prints the number of case clauses seen, together with the filename and line number of the switch statement itself.
  • Line 27, finally, restores the token position to the one recorded in variable q, so that when cobra avances the token position for the next round of processing it will point to the statement following the token matching switch that we just processed. In this way we will be able to count also switch statements that appear in code nested inside the body of the switch statement we just completed, and get a count for these as well.

Example 3

The only main language feature that we have not discussed yet is the associative array, which can be used to associate a value, string, or token reference with a text string or a value in a named array. The following, somewhat naive, example illustrates the basic concept:
	%{
		if (#float)
		{	. = .nxt;
			if (@ident)
			{	X[.txt] = 1;
				print .fnm ":" .lnr ": declaration of '" .txt "'\n";
			}
			Next;
		}
		if (@ident && X[.txt] > 0)
		{	print .fnm ":" .lnr ": use of float '" .txt "'\n";
		}
	%}
This example uses an associative array named X to remember that we have seen the string .txt. The array associates the array element in this case with a non-zero integer value. Although in this case the right-hand side of the assignment is a value, it can also be a string, or a token reference. The value stored in X is retrieved in the condition of the second if-statement. If there turns out to be no value stored for the string specified, the value returned will be zero.
The second if-statement checks for every identifier whether the corresponding text string from .txt was recorded before. If so, we know that this identifier first appeared following the C keyword float, and must therefore be a floating point variable. For simplicity here, this version ignores that variable declarations can include multiple names separated by commas, as well as initializers.
	%{
		if (#float)
		{	. = .nxt;
			if (@ident)
			{	Store[.txt] = .;	# store the current location
				print .fnm ":" .lnr ": declaration of '" .txt "'\n";
			}
			Next;
		}
		if (@ident)
		{	q = Store[.txt];
			if (q.lnr != 0)
			{	print .fnm ":" .lnr ": use of float '" .txt "' ";
				print "declared at " q.fnm ":" q.lnr "\n";
		}	}
	%}
In this version we store and retrieve a token reference, but then need to check that the retrieved value corresponds to an actual location that was set earlier. We do so by checking the line number field, which is never zero for an actual token.

Token variable and array references are preserved across runs of inline Cobra programs, which helps to make the following example work:

	%{
		# check the identifier length for all tokens
		# and remember the longest in q

		if (@ident && .len > q.len)
		{	q = .;
		}
	%}
	%{
		print "longest identifier: " q.txt " = " q.len " chars\n";
		Stop;	# stops the second run after the line is printed
	%}

Return to index
Manual
Tutorial
(Last Updated: 8 May 2017)