Mars Code: The DCAS Algorithm Revisited

The cover article of the February issue of the Communications of the ACM discussed the process that JPL followed in the development of the software for the Mars Science Laboratory mission, and its Curiosity rover [1]. The spacecraft was launched from Cape Canaveral on December 26, 2011, and landed safely in Gale Crater on Mars on August 6, 2012 for a nominal mission of 23 (Earth) months. Although not required, the rover may well continue to operate considerably longer.

Some 20 months into the mission (March 2014), there have been notably few residual software defects discovered in flight, despite the fact that the core flight software for this mission was substantially larger and more complex than for any previous mission.

MSL is controlled by close to 3 Million lines of C code, which is about five times more than the MER mission from 2003, and twenty times more than the Mars Pathfinder mission from 1996. Yet the number of software defects discovered in-flight on MSL so far has been at last an order of magnitude smaller than for the earlier missions.

There are many factors that may have contributed to the robustness of the software. They include a restructuring of the code, the introduction of a new code review process, new certification courses for flight software developers, the use of multiple state-of-the-art static source code analyzers, and the use of more advanced software analysis techniques, such as logic model checking.

Since legally flight code for interplanetary spacecraft falls under ITAR rules, the details of the code and its analysis cannot be published. For the CACM article we therefore used a small example from the literature to illustrate the use of a logic model checker to check subtle multi-threaded code for possible race conditions, data corruption, or deadlock. All these types of software flaws have been experienced on earlier Mars missions. Because the MSL code contains about 120 parallel threads of execution, the concern about potential concurrency flaws was quite real.

Simplifying Model Checking Model checking, like any other type of formal analysis, can be a time consuming and somewhat erudite process. Our effort in the development of logic model checking techniques over the years has therefore been focused on sufficiently simplyfing this process so that it can be accessible to any software developer, and performed as a routine step that does not require any specialized knowledge. Model checkers were first developed in the early eighties. The tool we developed started life as the Pan verifier at Bell Labs in 1981, and was first publically released as Spin in 1989. Since then the tool has continued to evolve, and has become one of the most widely used software verification tools for multi-threaded code. Originally, to use a logic model checker, the user had to manually construct a logic model of the algorithm to be checked in the specification language that is supported by the tool. (In the case of Spin that language is called ProMeLa, short for Process Meta Language.) The construction of that formal model can be time consuming.

To simplify the process, much effort has been devoted to the development of a method for mechanically extracting formal models from implementation level C source code directly. We built the Modex tool for this purpose. The model extractor was first applied on a commercial product in the logic verification of the call processing code from Lucent Technologies' PathStar switch, at Bell Laboratories in 1998 and 1999. Some more refinements of the basic method were made since then.

With this approach the formal verification of a surprisingly broad set of concurrent algorithms can today be automated. The only user requirement in many of these cases is the construction of a small test driver that exercises the code in a meaningful way. The test driver is very much what a developer would construct as part of a normal unit test already, so a formal verification does not impose any additional work in these cases.

Selecting an Example To illustrate the verification process for the CACM article, we picked a concurrent algorithm from the literature with a known bug: the double-sided queue implementation in C, using a double-word compare-and-swap instruction as the key synchronization primitive. The algorithm, first published in [2], had some flaws that were described in [3]. The most important flaw allows the same element to be retrieved from the queue by two different threads of execution, given a sufficiently demonic interleaving scenario for readers and writers. In [1] we tried to show how we could use the Modex model extractor to let the Spin model checker discover that flaw, using just a single test reader and writer process accessing the queue concurrently.

The article with the sample verification was proofread and reviewed in the normal way, before it was published. In the first six weeks after publication the article was downloaded an impressive 41,000 times from the ACM website, and of course the article appeared in print and was distributed to about 100,000 ACM members in February. Many readers commented on the article, and at least two of them (credit to: Yuliy Gerchikov and Thorkil Naur) checked the example from the article in detail. What aroused their curiosity was that the verification appeared to show that the flaw in the original algorithm could be exposed with just two threads of execution, and not three as shown in [3]. These two readers asked for more detail, and in the process of re-examining the example we discovered that the test driver processes we had defined for the example were fatally flawed.

One flaw, that in retrospect seems unlikely to have escaped attention for so long, is a mistake in the assertion that was used to catch the known flaw in the original algorithm. The sample writer process pushed elements into the queue from the left, and the reader process pulled them out of the queue from the same side. This means that the reader process could not really expect to retrieve the elements in any predictable order, and certainly not in the order in which the elements were sent. (If anything it would be the reverse order, if the writer process would be allowed to complete all pushes before the reader started.) The test reader process should of course be pulling elements from the right side of the queue, and not the left. Then we can indeed assert that the elements should be retrieved without loss or duplication in the same order as sent, no matter how the steps in the reader and writer process are interleaved.

Correcting this flaw quickly leads the model checker to find a different assertion violation, still using just one test reader and one writer process. More alert this time, a close inspection of this new counter-example showed that it too was flawed, though in this case the flaw was caused by a (known) limitation of the model extractor.

Synopsis:

More detail: The model extractor is designed to convert the control-flow of a program from C source code into Spin format, while encapsulating all C statements unchanged for execution during the model checking process. For expressions that are used in the conditional of any if, while, for, etc. statement must be side-effect free. If the conditional is an expression, the model extractor can perform some simple syntactic checks to make sure that this is the case. The expression, for instance, can be verified to be free from increment, decrement, or assignment operators, with a simple syntactic check. If the conditional is a function call though, the model extractor currently has to take it on faith that the function returns a boolean result without side-effects (i.e., is a pure function). For the example, it was important to show that one can extract a Spin model from the code without making modifications in the source. The original DCAS algorithm, though, contains calls to the DCAS instruction in the conditional of if-then-else statements, and that instruction (modeled as an atomic function call in the verification test harness) is not side-effect free when the returned value is true. This means that the algorithm has to be rewritten to change:

	if (DCAS(...)) { ... } else { ... }

	int tmp;
	tmp = DCAS(...);
	if (tmp) { ... } else { ... }

The Corrected Example If we perform a new verification with the corrected test drivers and the correct encoding of the DCAS routines, the invalid counter-example disappears, and we obtain instead proof that in this case the algorithm cannot fail, no matter how the parallel executions of the sample reader and writer processes are interleaved. The proof is rendered quickly. To produce it, the model checker explores 1.2 Million reachable states in about three seconds of CPU time:

	$ modex -run dcas.c
	MODEX Version 2.3 - 05 December 2013
	(Spin Version 6.2.7 -- 24 February 2014)
		+ Partial Order Reduction
	
	Full statespace search for:
		never claim         	- (none specified)
		assertion violations	+
		cycle checks       	- (disabled by -DSAFETY)
		invalid end states	+
	
	State-vector 528 byte, depth reached 1855, errors: 0
	  1215743 states, stored
	   794967 states, matched
	  2010710 transitions (= stored+matched)
	   591362 atomic steps
	hash conflicts:       460 (resolved)
	
	Stats on memory usage (in Megabytes):
	  644.639	equivalent memory usage for states (stored*(State-vector + overhead))
	  541.585	actual memory usage for states (compression: 84.01%)
	         	state-vector as stored = 439 byte + 28 byte overhead
	 4096.000	memory used for hash table (-w29)
	    0.534	memory used for DFS stack (-m10000)
	    1.763	memory lost to fragmentation
	 4636.671	total actual memory usage
	
	pan: elapsed time 2.97 seconds
	pan: rate 409341.08 states/second

Extending the Example We can extend the test drivers to check the correctness of the algorithm when multiple readers and writers can access the queue, still sticking with extracting the verification model from C source code alone. We have to change the assertion then, since values can now be retrieved from the queue by different readers. We can perform this check by recording every receive in a global array (with an atomic update), and verify that no value is received more than once. We can also check that eventually all values sent are received, by checking that the global array will eventually have recorded all values sent recorded as also received.

We define three new test processes, and the key assertion, as follows.

	#define MAX_N	3
	
	char received[MAX_N];	// non-zero when the corresponding value is received
	
	#define check_value(v)	assert(v >= 0 && v < MAX_N && received[v] == 0); received[v] = 1;
	
	void
	p1(void)
	{	int rv;
	
		while (!RightHat)
		{	/* wait */
		}
	
		rv = popLeft();
		if (rv != EMPTY)
		{	Printf("p1_rv=%d\n", rv);
			check_value(rv);
		}
	}
	
	void
	p2(void)
	{	int rv;
	
		while (!RightHat)
		{	/* wait */
		}
	
		rv = popRight();
		if (rv != EMPTY)
		{	Printf("p2_rv=%d\n", rv);
			check_value(rv);
		}
	}
	
	void
	p(void)
	{	int i, v;
	
		initialize();
	
		for (i = 0; i < MAX_N; i++)
		{	v = pushLeft(i);
			if (v != OKAY)
			{	i--;
		}	}
	}

Note that the p1 process executes a single popLeft() call, and the p2 process executes a single popRight call. We'll instantiate two copies of each, so that all these calls can be executed in parallel. In this way we create a total of four parallel attempts to retrieve elements from the queue, without specifying the order in which these attempts are made or how the execution of these calls is interleaved.

We use a single test driver process p to insert MAX_N elements into the queue, using just pushLeft() calls in this case.

To effect these changes we modify the guidance instructions to the model extractor as follows:

	%F dcas.c
	%X -e pushLeft
	%X -e2 popRight
	%X -e2 popLeft
	%X -e initialize
	%X -e dcas_malloc
	%X -a2 p1
	%X -a2 p2
	%X -a p
	%D
	#include "dcas.h"
	%O dcas.c

We need to instantiate two copies of the popLeft() and popRight() functions to allow the calls (initiated by the p1 and p2 processes) to be executed in parallel. The rest is as before. Clearly, with these changes the complexity of the verification increases significantly. Enough that a quick run of a few seconds will no longer suffice to render a verdict.

The best strategy in this case to still obtain a result in a reasonable amount of time is to perform random sampling of the reachable state space. This can be done with the Spin model checker by making use of the BITSTATE search option. (A standard search option in the model checker for handling very large problem sizes.) In addition, we now have access to a large number of different search settings, to allow the model checker to search randomly different parts of the state space. A good strategy here is to use a swarm approach, where we use a simple preprocessor to Spin (not surprisingly called swarm) to generate a script that in a user-defined amount of time (we picked two hours) and a given number of processing cores (we picked 50) perform many small randomly different searches for an assertion violation. This approach can be remarkably effective for very complex verification problems that may defy analysis by other means.

The swarm configuration file we can now use looks as follows.

	# range
	k       1       4       # optional: to restrict min and max nr of hash functions
	 
	# limits
	d       25000           # optional: to restrict the max search depth
	cpus    50              # nr available cpus (exact)
	memory  1G              # max memory per run; M=megabytes, G=gigabytes
	time    120m            # max time for all runs; h=hours, m=min, s=sec, d=days
	hash    1.5             # hash-factor (estimate)
	vector  640             # nr of bytes per state (estimate)
	speed   200000          # nr states explored per second (estimate)
	file    model           # file with the spin model to be verified
	 
	# compilation options (each line defines one complete search mode)
	# each option currently must include -DBITSTATE and -DPUTPID
	 
	-DBITSTATE -DPUTPID dcas.c              # basic dfs
	-DBITSTATE -DPUTPID -DREVERSE dcas.c    # reversed process ordering
	-DBITSTATE -DPUTPID -DT_REVERSE dcas.c  # reversed transition ordering
	-DBITSTATE -DPUTPID -DREVERSE -DT_REVERSE dcas.c        # both
	-DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c    # same series with randomization
	-DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DT_REVERSE
	-DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DREVERSE
	-DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DREVERSE -DT_REVERSE
	-DBITSTATE -DPUTPID -DBCS dcas.c        # with bounded context switching
	-DBITSTATE -DPUTPID -DBCS dcas.c -DREVERSE
	-DBITSTATE -DPUTPID -DBCS dcas.c -DT_REVERSE
	-DBITSTATE -DPUTPID -DBCS dcas.c -DREVERSE -DT_REVERSE

From this configuration file the swarm tool generates a script that will spawn 574 small searches on 50 cpu cores to hunt for any flaw in the DCAS algorithm, using no more than the upperbound of two hours of CPU time that is specified.

Assertion Violation One of the random swarm search jobs now succeeds in finding a flaw, approximately one hour into the search. The scenario that is found is not the shortest or simplest possible, but it does capture the basic type of flaw that was exposed earlier in [3]: a case where two parallel reader processes (one, pid 2, executing a popRight() call and one, pid 4, executing a popLeft() call) succeed in retrieving the same element from the queue.

For the full trail (reproduced with the model checker command ./pan -C) see: trail.html. For a numbered listing of the full model as extracted from the C sources see: model.html.

Since the ./pan -C replay option in Spin provides us with a step by step executable failure scenario as C code, we can add any additional detail that can assist us in diagnosing the exact cause of the failure at this point. We can, for instance, visualize the scenario in more detail by also capturing the values of all pointer variables that play a role in the queue updates. This is easily done by editing the pan.c model checker file, and adding printf statements in the replay code for the relevant information. We can then display the key states of the queue, and values of all pointers, after each successful execution of a DCAS instruction in eight subsequent steps, in the same format that was used for Figures 6.1 through 6.5 for the (slightly different) failure scenario presented earlier in [3].

Note that this scenario was found by an undirected search of the logic model checker and contains more steps than are strictly necessary to create the failure. The first few steps in the sequence are merely creating the right conditions for the fault to occur, pushing several elements into the queue with pushLeft() calls.

Two elements, V0 and V1 have now been added. The first element (V0) is popped from the queue at this point with a popRight() call. In the figure we also provide the step number in the scenario reported by the model checker where the relevant DCAS operation appears. In this case the DCAS appears in step 161 of the full scenario listing. (Not all these steps are shown in the summary of the key portion we showed above).

Next the third element, V2 is added with pushLeft(2) by the process executing function p (with pid 0).

In the remainder of the scenario, three elements are popped from the queue, two with popLeft() and one with popRight(). The last two of the calls are already progress and pause at just the wrong place in their code so that the failure can occur.

The first popLeft() returns the value 2 (V2) and is executed by a first instance process p1 (pid=3).

The next call to complete is popLeft() executed by a different instance of the p1 process (pid=4), and returns the value 1 (V1).

Now a popRight() executed by an instance of process p2 (with pid=2) completes and again returns the value 1 (V1) -- which violates the constraint that no queue element should be returned more than once.

The notation "(h) of popRight() of 1 by p2:2:10 in step 295" in the caption to the last figure means: in step 295 of the trail file the instantiation of function p2 with pid=2, completed execution of a DCAS instruction in the popRight() function, using a helper thread with pid=10. The value eventually returned by that popRight() call was 1 (corresponding to queue element V1).

A Further Generalization We can take the verification a step further by defining a slightly more general test harness for the DCAS algorithm. We want to avoid assuming anything about the nature of the flaw in the original algorithm from [2], and specifically how many parallel processes or queue elements it might take to reveal it. We will illustrate here how one can formalize the test process for this purpose in the syntax of the model checker, which has a richer semantics for defining logic verifications of this type. This allows us to avoid saying anything about the specific order in which order push and pop operations are performed by each process, or wether these operations are performed on the left or on the right side of the queue.

This of course defeats our original purpose in [1], which was to show that the entire verification process can in many cases be based on C code alone. We chose the wrong example to illustrate this.

We can ask the model checker to explore all possible ways in which N processes can push and pop M different elements (carrying M different integer values). A test driver, specified in ProMeLa, that does so looks as follows.

	active [N] proctype ptest()	// create N processes
	{	byte tmp;		// a scratch variable used inside the inline functions
		short v;
	
		if
		:: _pid == 0 -> c_code { // the initialize() routine
	 		now.Dummy = (Node *) dcas_malloc(sizeof(Node));
			now.Dummy->L = now.Dummy->R = now.Dummy;
			now.LeftHat  = now.Dummy;
			now.RightHat = now.Dummy;
			}
		:: c_expr { (now.RightHat) }
		fi;
	
		do
		:: atomic { to_send < M ->	// up to M elements
				v = to_send;
				to_send++;
			};
			if
			:: pushLeft()	// sends v
			:: pushRight()	// sends v
			fi;
			printf("sent %d\n", v)
	
		:: true ->
			if
			:: popLeft()	// rcv into v
			:: popRight()	// rcv into v
			fi;
			atomic {
			if
			:: v >= 0 ->
				printf("rcv %d\n", v);
				assert(v >= 0 && v < M && val_rcv[v] == 0);
				val_rcv[v] = 1
			:: else
			fi
			}
		od
	}

The pushLeft(), pushRight(), popLeft(), and popRight() routines are instrumented as inline functions here, with the bodies of the inlines extracted by the Modex tool. We've modified the extracted code slightly for this purpose to make sure the routines retrieve the value to be pushed from variable v, and the pop routines return the value popped in that same variable.

If we set N to 2 and M to 3, the verification explores a little over 15 Million reachable states in about 15 seconds, and renders the verdict that no assertion violations are possible: i.e., the algorithm is guaranteed to work correctly under these conditions.

	(Spin Version 6.2.7 -- 24 February 2014)
	        + Partial Order Reduction
	
	Hash-Compact 4 search for:
	        never claim             - (none specified)
	        assertion violations    +
	        cycle checks            - (disabled by -DSAFETY)
	        invalid end states      +
	
	State-vector 376 byte, depth reached 541, errors: 0
	 15241309 states, stored
	 12464615 states, matched
	 27705924 transitions (= stored+matched)
	   874965 atomic steps
	hash conflicts:   3895132 (resolved)
	
	Stats on memory usage (in Megabytes):
	 5872.239       equivalent memory usage for states (stored*(State-vector + overhead))
	  581.656       actual memory usage for states (compression: 9.91%)
	                state-vector as stored = 12 byte + 28 byte overhead
	  128.000       memory used for hash table (-w24)
	    0.534       memory used for DFS stack (-m10000)
	  710.175       total actual memory usage
	
	
	unreached in proctype ptest
	        model_n2_m3:43, state 12, "assert(0)"
	        model_n2_m3:234, state 127, "-end-"
	        (2 of 127 states)
	
	pan: elapsed time 15.1 seconds
	pan: rate 1012039.1 states/second

Note that we now explore more reachable states then we did before with the sample reader and writer processes, but the result we obtain here is stronger because we have broadened the scope of the verification to include all four push and pop routines, and not just two (pushRight() and popLeft()). In the inline encodings of the push routines we used an assertion (assert(0)) to catch cases where the malloc routine could reach a state where the heap is exhausted. We've sized the heap large enough for this case that this cannot happen, and thus the assert(0) statements remains unreachable in the model.

If we repeat this verification by increasing the number of elements pushed into the queue from four (setting N to 2 and M to 4), the result remains unchanged. The complexity of that verification again increases significantly: the model checker must now explore about 248 Million reachable states and takes 5 minutes to render its verdict.

More interesting is the case where we increase N from 2 to 3, and keep M equal to 3. This verification too is considerably more complex, but the model checker can identify a counter-example reliably. A default bitstate verification run, for instance, successfully identifies the critical assertion violation after exploring 44 Million reachable states, in about two minutes.

	pan:1: assertion violated (((v>=0)&&(v<3))&&(val_rcv[v]==0)) (at depth 105)
	pan: wrote model_n3_m3.trail

	(Spin Version 6.2.7 -- 24 February 2014)
	Warning: Search not completed
	        + Partial Order Reduction
	
	Bit statespace search for:
	        never claim             - (none specified)
	        assertion violations    +
	        acceptance   cycles     - (not selected)
	        invalid end states      +
	
	State-vector 440 byte, depth reached 1040, errors: 1
	 44335269 states, stored
	 50831172 states, matched
	 95166441 transitions (= stored+matched)
	        0 atomic steps
	
	hash factor: 3.02734 (best if > 100.)
	
	bits set per state: 3 (-k3)
	
	Stats on memory usage (in Megabytes):
	19787.699       equivalent memory usage for states (stored*(State-vector + overhead))
	   16.000       memory used for hash array (-w27)
	    0.076       memory used for bit stack
	    0.534       memory used for DFS stack (-m10000)
	   17.413       total actual memory usage
	
	pan: elapsed time 116 seconds
	pan: rate 383788.69 states/second

The counter-example is similar to the version shown above and the one discussed in [3].

We can also perform a further test and check if, for instance, reducing the number of elements pushed into the queue from 3 to 2 can still cause the algorithm to fail. This turns out not to be the case (using N equal to 3 and M equal to 2):

	(Spin Version 6.2.7 -- 24 February 2014)
	        + Partial Order Reduction
	
	Hash-Compact 4 search for:
	        never claim             - (none specified)
	        assertion violations    +
	        cycle checks            - (disabled by -DSAFETY)
	        invalid end states      +
	
	State-vector 440 byte, depth reached 2071, errors: 0
	5.2636082e+08 states, stored
	6.8725985e+08 states, matched
	1.2136207e+09 transitions (= stored+matched)
	 48577981 atomic steps
	hash conflicts: 5.6197891e+08 (resolved)
	
	Stats on memory usage (in Megabytes):
	234925.141      equivalent memory usage for states (stored*(State-vector + overhead))
	20639.399       actual memory usage for states (compression: 8.79%)
	                state-vector as stored = 13 byte + 28 byte overhead
	 2048.000       memory used for hash table (-w28)
	    0.534       memory used for DFS stack (-m10000)
	22688.046       total actual memory usage
	
	
	unreached in proctype ptest
	        model_n3_m2:43, state 12, "assert(0)"
	        model_n3_m2:234, state 127, "-end-"
	        (2 of 127 states)
	
	pan: elapsed time 857 seconds
	pan: rate 614505.49 states/second

This full verification takes about 15 minutes and explores 526 Million reachable states.

Other tests, varying the number of processes or queue elements, can be performed, for instance for N=4 and M=2, but do not reveal any new types of failures. Increasing the number of processes above 3 does increase the verification complexity such that only statistical sampling methods can be used to perform the verifications.

As a test we ran a single BITSTATE verification for about 138 hours with this configuration, exploring over 10^11 reachable states, without encountering assertion failures.

This illustrates the main difference between approaches based on logic model checking, as illustrated here (and as we attempted to illustrate in [1]), and those based on theorem proving techniques, as illustrated in [3]. Logic model checkers can excel in rapidly identifying subtle bugs in small to medium size test cases, while a verification based on theorem proving techniques can be significantly more time consuming but when successful can produce results that hold for arbitrary problem sizes.

"The purpose of analysis is not to compel belief but to suggest doubt."
Imre Lakatos [4]

Our original objective for using the analysis of the DCAS algorithm as an example in [1] was to show two things:

That the information required by a logic model checker for doing exhaustive proofs of correctness can in many cases be extracted mechanically from C source code, without requiring a user to hand-craft a verification model first.
That the verification itself, in many cases, can be performed very quickly.

With the gentle benefit of hindsight, we picked the wrong example to illustrate especially the second point. For the DCAS example we can show that the complete context for a verification can be extracted mechanically from C sources, but it would be a stretch to say that the final verification steps for this example can be executed quickly.

In model checking applications we train ourselves to be highly suspicious of results that fail to produce counter-example trails, seemingly showing that an application is correct. This is almost always the right course, since so few manually designed concurrent algorithms are free from subtle flaws. As we learn in this example, it is good to be suspicious of all results of verification attempts (mechanical or otherwise), and do due diligence in verifying the validity of all results, especially of the ones we expect to get. Many thanks to Yuliy Gerchikov and Thorkil Naur for reminding me!

References

Holzmann, G., Mars Code, Communications of the ACM, Vol. 57, No. 2, Feb. 2014, pp. 64-73.
Detlefs, D.L. et al. Even Better DCAS-based concurrent deques. In Distributed Algorithms, LNCS Vol. 1914, M. Herlihy, Ed. Springer-Verlag, Heidelberg, Germany, 2000, pp. 59-73.
Doherty, S. Modelling and Verifying Non-blocking Algorithms that Use Dynamically Allocated Memory, Master's thesis, Victoria University, Wellington, New Zealand, 2004.
Lakatos, I., Proofs and Refutations, 1977.

March 9, 2014.