Cobra Inline Programs functions

NAME

predefined functions

DESCRIPTION

Token Processing

Two predefined functions relate exclusively to token processing.
	Next	# stops processing the current token, and moves to the next
		# token in the sequence to restart the inline program
	Stop	# stops processing the current token, and does not
		# proceed to the next token
They are discussed in more detail in the manual page for tokens.

Three other functions are of a more general nature, but defined for use in Cobra scripts as equivalents of query commands:

	reset()	 # resets all marks and bounds to zero (like the query command)
	fcts()	 # marks the names in all C function definitions
	marks(n) # returns the number of marks in set n,
		 # where n is 0..3 and marks(0) refers to the current set

String functions

Three predefined functions are defined for operations on text strings:
	match(s1, s2)	-- true if string s1 matches s2, where s2 can be a regular expression
	strlen(s)	-- returns the length of string s
	substr(s, n, m)	-- returns the m-character substring of s starting at n
For instance:
	%{
		if (match(.txt, "/[Yy][Yy]"))	# regex
		{	# matches if .txt contains YY yy Yy or yY
			print .fnm ":" .lnr " " .txt "\n";
		}
		if (match(.fnm, "//usr"))	# regex
		{	# matches if the filename contains /usr
			.mark++;
		}
		if (match(.fnm, "\/usr"))	# not a regex
		{	# matches if the filename equals /usr
			cnt++;
		}
		if (.txt == "/usr")		# not a regex
		{	# matches if the filename equals /usr
			first_t.mark++;
		}
	%}
Note that if regular expressions aren't required, it is often simpler to use the builtin shortcut operators #, @, ^ or ~ for referring to token text fields (see expressions), as in:
	if (#/usr)	is the same as:  if (match(.txt, "\/usr"))
	if (@const_int)	is the same as:  if (match(.typ, "const_int"))

Associate Array Functions

Two functions support operations on associative arrays:
	retrieve(A, n)	-- retrieves the nth index of associative array A
	size(A)		-- returns the number of elements stored in array A
So to print the value stored at the nth index, one would use:
	print "index " n " is " retrieve(A,n) " and holds: " A[retrieve(A,n)] "\n";

Lists, Queues, Stacks

There are ten functions that can be used to manipulate collections of tokens, with synonyms for most of them. Lists do not need to be declared, they are created when first accessed.
	Function:			Synonym:
	list_add_top(name, token)	list_push(name, token)	-- add token to start of list 'name'
	list_add_bot(name, token)	list_append(name, token) -- add token to end of list 'name'
	list_del_top(name)		list_pop(name)		-- remove and release token from start of list
	list_del_bot(name)		list_chop(name)		-- remove and release token from end of list
	list_get_top(name)		list_top(name)		-- return token at start of list (not removed)
	list_get_bot(name)		list_bot(name)		-- return token at end of list (not removed)
	list_new_tok()			list_tok()		-- return a new token, to use in lists
	list_rel_tok(token)		list_tok_rel(token)	-- release a token never stored on any list
	list_rel(name)						-- remove list and release all tokens on it
	list_len(name)						-- return the number of tokens in list 'name'
An example of using the list functions can be found in the Cobra rules directory: rules/play/list_test.cobra.

General Utility Functions

A general purpose function allows us to delete either entire arrays or specific elements of an array:
	unset A[el]	-- remove associative array element A[el]
	unset A		-- remove variable or array A
Three other general purpose predefined functions are:
	assert(expr)	-- to check the truth of an expression
	print args	-- the print statement
The assert function has the usual purpose and behavior, which can be helpful in debugging new Cobra programs.

The print statement is used to print text and values on the standard output. It can take any number of argument, of any type, each of which is converted to text before printing. There is no format string, like in C, and only two special symbol sequences are recognized to change the appearance of the output: the tab meta-character \t, and the newline character \n.
Even though all arguments to the print statement are strings, there is no need to catenate them with + operators: the catenation is automatic. Note that if you print a token value, rather than a token field, the value printed will represent the unique address of that token, as in:

	%{
		print . "\n";
		Stop;
	%}
	0x600010058

The next two utility functions cover special cases that will occur more infrequently. They are

	newtok()		# to create a new lexical token, for scratch values
	set_ranges(from,to)	# to (re)define the first and last token in an input sequence
The newtok() function is useful to create a placeholder token, without modifying any existing token. It takes no arguments.
For instance:
	%{
		if (. == Begin)
		{	q = newtok();
			q.txt = "foobar";
		}
		...
	%}
The use of the function set_ranges(from, to) in combination with newtok() allows us to write Cobra programs that can completely (re)define all or some of the input sequence. For instance:
	%{
		a = newtok(); a.txt = "2";
		b = newtok(); b.txt = "+"; a.typ = "oper";
		c = newtok(); c.txt = "2";
		a.nxt = b;
		b.nxt = c;
		set_ranges(a, c);
		Stop;
	%}
	%{
		print .txt "\n";
	%}
If we run this program on arbitrary other input, the result printed would be:
	2
	+
	2
Note that to be able to execute this first part of this program the input sequence must contain at least one token. The program then executes for that token, replaces the input token sequence with its own, and stops. If there is no token, the program cannot execute.

The final two predefined functions provide the equivalents of the save and restore query commands within scripts. They are:

	save(n, s)	# with n:1..3 and s: "", "|", "&", or "^"
	restore(n, s)	# with n:1..3 and s: "", "|", "&", or "^"

Multi-Core Support

The following collection of functions is for multi-core use. First there is a lock and unlock function for enforcing a single global lock on executions:
	lock		# make sure only one core can pass this lock
	unlock		# until the lock is released again with unlock
To see the need for the next two functions, remember that each core during a multi-core run will process a different portion of the input token sequence, and maintains separate copies of the variables and arrays that are constructed.
For final processing, when we want a single core to collect all data and report on it, we need to be able to either sum (integers) or unify (associative arrays) the data. For this we can use the predefined functions sum and a_unify.
	sum(varname)	# add up all the values of an integer variable across cores
	sum(A[el])	# add up the values of integer array elements across cores
	a_unify(n)	# unify all array data objects to be accesible on cpu n
	a_unify(a, n)	# unify the contents of array a to be accesible on cpu n
A simple example of the working of the sum function for a simple variable, when running with four cores:
	: %{ if (@ident) { x++; } %}
	: %{ y = sum(x); print x " " y "\n"; Stop; %}
	5652 23020 
	5690 23020 
	5213 23020 
	6465 23020 
	:
which will of course fail if the variable holds anything other than integer values.

An example that illustrates the use of both the sum and the a_unify function is the following Cobra program that can be used to find the ten most frequently occuring trigrams of token types in the input.

	ncore 4	# use four cores in paralel
	%{
		q = .nxt;
		r = q.nxt;
		if (.typ != "" && q.typ != "" && r.typ != "")
		{	Trigram[.typ, q.typ, r.typ]++;
		}
	%}
	track start _tmp_
	%{
		if (cpu != 0)
		{	Stop;
		}
		a_unify(0);
		for (i in Trigram)
		{	print i.txt "\t" sum(Trigram[i.txt]) "\n";
		}
		Stop;
	%}
	track stop
	!sort -k2 -n < _tmp_ | tail -10; rm -f _tmp_
At the time of writing, applying this Cobra program to the Cobra sources produces the following output:
	$ cobra -f play/trigram *.[ch]
	ident,oper,chr 		209 
	const_int,oper,ident 	231 
	oper,oper,ident 	232 
	storage,type,oper 	239 
	key,const_int,oper 	250 
	storage,type,ident 	702 
	ident,oper,const_int 	1000 
	type,oper,ident 	1298 
	oper,ident,oper 	3541 
	ident,oper,ident 	5695
The processing uses two inline program fragments. The first collects the data, which is assumed to be performed in a multi-core run. It will also work correctly if only a single cpu is used of course, but in that case we could use a simpler version of the code as well. Before the second program fragment is executed we divert the output to a file called _tmp_ with a track command. We conclude the diversion after the fragment has executed, and then use a regular extern sort and tail command to process the data, after which the temporary tracking file is deleted.

The second program fragment is used to process the data that was collected by the multiple cpus before -- each cpu working on a different portion of the input sequence. The cores store the data they collect in private copies of the associative array, to avoid race conditions or the chance of data corruption with simultaneous access to the same fields of the array. This means that we must unify the contents of the associative array to make all elements visible to a given cpu. We select the cpu numbered 0 for this, and halt all other cpus at the start of the second fragment.

Cpu 0 then calls a_unify, passing it its cpu number 0. Once this call completes, we cpu 0 has access to all the array indices of Trigram, no matter which core added that array index. Importantly though, the values stored at the indices are not collected. Note that the value stored could be a token reference, a string, or a number, so it isn't possible to define a uniform way to unify all that data. In this case, we know the array elements are used to store integer counts, and we want to add up all those counts. That is done with the call to the predefined function sum.


Inline Programs
Manual
Tutorial
(Last Updated: 7 September 2020)