## Mars Code: The DCAS Algorithm RevisitedThe cover article of the February issue of the Communications of the ACM discussed the process that JPL followed in the development of the software for the Mars Science Laboratory mission, and its Curiosity rover [1]. The spacecraft was launched from Cape Canaveral on December 26, 2011, and landed safely in Gale Crater on Mars on August 6, 2012 for a nominal mission of 23 (Earth) months. Although not required, the rover may well continue to operate considerably longer. Some 20 months into the mission (March 2014), there have been notably few residual software defects discovered in flight, despite the fact that the core flight software for this mission was substantially larger and more complex than for any previous mission. MSL is controlled by close to 3 Million lines of C code, which is about five times more than the MER mission from 2003, and twenty times more than the Mars Pathfinder mission from 1996. Yet the number of software defects discovered in-flight on MSL so far has been at last an order of magnitude smaller than for the earlier missions.There are many factors that may have contributed to the robustness of the software. They include a restructuring of the code, the introduction of a new code review process, new certification courses for flight software developers, the use of multiple state-of-the-art static source code analyzers, and the use of more advanced software analysis techniques, such as logic model checking. Since legally flight code for interplanetary spacecraft falls under ITAR rules, the details of the code and its analysis cannot be published. For the CACM article we therefore used a small example from the literature to illustrate the use of a logic model checker to check subtle multi-threaded code for possible race conditions, data corruption, or deadlock. All these types of software flaws have been experienced on earlier Mars missions. Because the MSL code contains about 120 parallel threads of execution, the concern about potential concurrency flaws was quite real.
Model checking, like any other type of formal analysis,
can be a time consuming and somewhat erudite process.
Our effort in the development of logic model checking techniques over the years
has therefore been focused on sufficiently simplyfing this process so that
it can be accessible to any software developer, and performed as a routine
step that does not require any specialized knowledge.
Model checkers were first developed in the early eighties.
The tool we developed started life as the Simplifying Model CheckingPan verifier at Bell Labs
in 1981, and was first publically released as Spin
in 1989.
Since then the tool has continued to evolve, and has become one of
the most widely used software verification tools for multi-threaded code.
Originally, to use a logic model checker, the user had to manually
construct a logic model of the algorithm to be checked in the specification
language that is supported by the tool. (In the case of Spin that language
is called ProMeLa, short for Process Meta Language.)
The construction of that formal model can be time consuming.
To simplify the process, much effort has been devoted to the development of a method for mechanically extracting formal models from implementation level C source code directly. We built the Modex tool for this purpose. The model extractor was first applied on a commercial product in the logic verification of the call processing code from Lucent Technologies' PathStar switch, at Bell Laboratories in 1998 and 1999. Some more refinements of the basic method were made since then. With this approach the formal verification of a surprisingly broad set of concurrent algorithms can today be automated. The only user requirement in many of these cases is the construction of a small test driver that exercises the code in a meaningful way. The test driver is very much what a developer would construct as part of a normal unit test already, so a formal verification does not impose any additional work in these cases.
To illustrate the verification process for the CACM article,
we picked a concurrent algorithm from the literature with a known bug:
the double-sided queue implementation in C,
using a double-word compare-and-swap instruction as the key synchronization
primitive. The algorithm, first published in [2], had some flaws that were
described in [3].
The most important flaw allows the same element to be retrieved from the queue by two
different threads of execution, given a sufficiently demonic interleaving
scenario for readers and writers.
In [1] we tried to show how we could use the Modex model extractor to let the
Spin model checker discover that flaw, using just a single
test reader and writer process accessing the queue concurrently.
Selecting an Example
The article with the sample verification was proofread and reviewed in the
normal way, before it was published. In the first six weeks after publication
the article was downloaded an impressive 41,000 times from the ACM website,
and of course the article appeared in print and was distributed to about
100,000 ACM members in February.
Many readers commented on the article, and at least
two of them (credit to: Yuliy Gerchikov and Thorkil Naur)
checked the example from the article in detail. What aroused their curiosity was
that the verification appeared to show that the flaw in the original
algorithm could be exposed with just One flaw, that in retrospect seems unlikely to have escaped attention for so long, is a mistake in the assertion that was used to catch the known flaw in the original algorithm. The sample writer process pushed elements into the queue from the left, and the reader process pulled them out of the queue from the same side. This means that the reader process could not really expect to retrieve the elements in any predictable order, and certainly not in the order in which the elements were sent. (If anything it would be the reverse order, if the writer process would be allowed to complete all pushes before the reader started.) The test reader process should of course be pulling elements from theCorrecting this flaw quickly leads the model checker to find a different assertion violation, still using just one test reader and one writer process. More alert this time, a close inspection of this new counter-example showed that it too was flawed, though in this case the flaw was caused by a (known) limitation of the model extractor. Synopsis: The model extractor expects all
conditional tests to be side-effect free. The source code for the original
DCAS algorithm did not satisfy this restriction.
if (DCAS(...)) { ... } else { ... }into int tmp; tmp = DCAS(...); if (tmp) { ... } else { ... }The change is small, but without it the model checker cannot guarantee to preserve the logical soundness of the verification. And, sadly, the DCAS example illustrates this all too well.
If we perform a new verification with the corrected test drivers The Corrected Exampleand
the correct encoding of the DCAS routines, the invalid counter-example
disappears, and we obtain instead proof that in this case the algorithm
cannot fail, no matter how the parallel executions of the sample reader
and writer processes are interleaved.
The proof is rendered quickly.
To produce it, the model checker explores 1.2 Million reachable
states in about three seconds of CPU time:
$ modex -run dcas.c MODEX Version 2.3 - 05 December 2013 (Spin Version 6.2.7 -- 24 February 2014) + Partial Order Reduction Full statespace search for: never claim - (none specified) assertion violations + cycle checks - (disabled by -DSAFETY) invalid end states + State-vector 528 byte, depth reached 1855, errors: 0 1215743 states, stored 794967 states, matched 2010710 transitions (= stored+matched) 591362 atomic steps hash conflicts: 460 (resolved) Stats on memory usage (in Megabytes): 644.639 equivalent memory usage for states (stored*(State-vector + overhead)) 541.585 actual memory usage for states (compression: 84.01%) state-vector as stored = 439 byte + 28 byte overhead 4096.000 memory used for hash table (-w29) 0.534 memory used for DFS stack (-m10000) 1.763 memory lost to fragmentation 4636.671 total actual memory usage pan: elapsed time 2.97 seconds pan: rate 409341.08 states/second
We can extend the test drivers to check the correctness of the algorithm
when multiple readers and writers can access the queue, still sticking with
extracting the verification model from C source code alone.
We have to change the assertion then, since values can now
be retrieved from the queue by different readers.
We can perform this check by recording every receive in a global array
(with an atomic update), and verify that no value is received more than once.
We can also check that eventually all values sent are received,
by checking that the global array will eventually have recorded all values
sent recorded as also received.
Extending the ExampleWe define three new test processes, and the key assertion, as follows. #define MAX_N 3 char received[MAX_N]; // non-zero when the corresponding value is received #define check_value(v) assert(v >= 0 && v < MAX_N && received[v] == 0); received[v] = 1; void p1(void) { int rv; while (!RightHat) { /* wait */ } rv = popLeft(); if (rv != EMPTY) { Printf("p1_rv=%d\n", rv); check_value(rv); } } void p2(void) { int rv; while (!RightHat) { /* wait */ } rv = popRight(); if (rv != EMPTY) { Printf("p2_rv=%d\n", rv); check_value(rv); } } void p(void) { int i, v; initialize(); for (i = 0; i < MAX_N; i++) { v = pushLeft(i); if (v != OKAY) { i--; } } }Note that the p1 process executes a single popLeft() call,
and the p2 process executes a single popRight call.
We'll instantiate two copies of each, so that all these calls can be executed
in parallel.
In this way we create a total of four parallel attempts to retrieve elements
from the queue, without specifying the order in which these attempts are made
or how the execution of these calls is interleaved.
We use a single test driver process To effect these changes we modify the guidance instructions to the model extractor as follows: %F dcas.c %X -e pushLeft %X -e2 popRight %X -e2 popLeft %X -e initialize %X -e dcas_malloc %X -a2 p1 %X -a2 p2 %X -a p %D #include "dcas.h" %O dcas.cWe need to instantiate two copies of the popLeft() and
popRight() functions to allow the calls (initiated by the
p1 and p2 processes) to be executed in parallel.
The rest is as before.
Clearly, with these changes the complexity of the verification
increases significantly. Enough that a quick run of a few seconds will no
longer suffice to render a verdict.
The best strategy in this case to still obtain a result in a reasonable
amount of time is to perform random sampling of the reachable
state space. This can be done with the Spin model checker by making use of
the The swarm configuration file we can now use looks as follows. # range k 1 4 # optional: to restrict min and max nr of hash functions # limits d 25000 # optional: to restrict the max search depth cpus 50 # nr available cpus (exact) memory 1G # max memory per run; M=megabytes, G=gigabytes time 120m # max time for all runs; h=hours, m=min, s=sec, d=days hash 1.5 # hash-factor (estimate) vector 640 # nr of bytes per state (estimate) speed 200000 # nr states explored per second (estimate) file model # file with the spin model to be verified # compilation options (each line defines one complete search mode) # each option currently must include -DBITSTATE and -DPUTPID -DBITSTATE -DPUTPID dcas.c # basic dfs -DBITSTATE -DPUTPID -DREVERSE dcas.c # reversed process ordering -DBITSTATE -DPUTPID -DT_REVERSE dcas.c # reversed transition ordering -DBITSTATE -DPUTPID -DREVERSE -DT_REVERSE dcas.c # both -DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c # same series with randomization -DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DT_REVERSE -DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DREVERSE -DBITSTATE -DPUTPID -DP_RAND -DT_RAND dcas.c -DREVERSE -DT_REVERSE -DBITSTATE -DPUTPID -DBCS dcas.c # with bounded context switching -DBITSTATE -DPUTPID -DBCS dcas.c -DREVERSE -DBITSTATE -DPUTPID -DBCS dcas.c -DT_REVERSE -DBITSTATE -DPUTPID -DBCS dcas.c -DREVERSE -DT_REVERSEFrom this configuration file the swarm tool generates a script that will spawn 574 small searches on 50 cpu cores to hunt for any flaw in the DCAS algorithm, using no more than the upperbound of two hours of CPU time that is specified.
popRight() call and one,
pid 4, executing a popLeft() call) succeed in retrieving the
same element from the queue.
For the full trail (reproduced with the model checker command ./pan -C) see: trail.html. For a numbered listing of the full model as extracted from the C sources see: model.html.Since the ./pan -C replay option in Spin provides us with a step by
step executable failure scenario as C code, we can add any additional detail
that can assist us in diagnosing the exact cause of the failure at this point.
We can, for instance, visualize the scenario in more detail by also capturing the
values of all pointer variables that play a role in the queue updates.
This is easily done by editing the pan.c model checker file, and adding printf statements
in the replay code for the relevant information.
We can then display the key states of the queue, and values of all pointers, after
each successful execution of a DCAS instruction in eight subsequent steps,
in the same format that was used for Figures 6.1 through 6.5 for the
(slightly different) failure scenario presented earlier in [3].
Note that this scenario was found by an undirected search of the logic model
checker and contains more steps than are strictly necessary to create the failure.
The first few steps in the sequence are merely creating the right conditions for
the fault to occur, pushing several elements into the queue with
Two elements, V0 and V1 have now been added.
The first element (V0) is popped from the queue at this point with
a
Next the third element, V2 is added with
In the remainder of the scenario, three elements are popped from
the queue, two with
The first
The next call to complete is
Now a The notation
We can take the verification a step further by
defining a slightly more general test harness for the DCAS algorithm.
We want to avoid assuming anything about the nature of
the flaw in the original algorithm from [2], and specifically how many
parallel processes or queue elements it might take to reveal it.
We will illustrate here how one can formalize the test process for this
purpose in the syntax of the model checker, which has a richer semantics
for defining logic verifications of this type.
This allows us to avoid saying anything about the specific order
in which order push and pop operations are performed by each
process, or wether these operations are performed on the left
or on the right side of the queue.
A Further GeneralizationThis of course defeats our original purpose in [1], which was to show that the entire verification process can in many cases be based on C code alone. We chose the wrong example to illustrate this.We can ask the model checker to explore all possible ways in which N
processes can push and pop M different elements (carrying M
different integer values).
A test driver, specified in ProMeLa, that does so looks as follows.
active [N] proctype ptest() // create N processes { byte tmp; // a scratch variable used inside the inline functions short v; if :: _pid == 0 -> c_code { // the initialize() routine now.Dummy = (Node *) dcas_malloc(sizeof(Node)); now.Dummy->L = now.Dummy->R = now.Dummy; now.LeftHat = now.Dummy; now.RightHat = now.Dummy; } :: c_expr { (now.RightHat) } fi; do :: atomic { to_send < M -> // up to M elements v = to_send; to_send++; }; if :: pushLeft() // sends v :: pushRight() // sends v fi; printf("sent %d\n", v) :: true -> if :: popLeft() // rcv into v :: popRight() // rcv into v fi; atomic { if :: v >= 0 -> printf("rcv %d\n", v); assert(v >= 0 && v < M && val_rcv[v] == 0); val_rcv[v] = 1 :: else fi } od }The pushLeft(), pushRight(), popLeft(),
and popRight() routines are instrumented as inline functions here,
with the bodies of the inlines extracted by the Modex tool.
We've modified the extracted code slightly for this purpose to make sure
the routines retrieve the value to be pushed from variable v,
and the pop routines return the value popped in that same variable.
If we set N to 2 and M to 3, the verification explores a little over 15 Million reachable states in about 15 seconds, and renders the verdict that no assertion violations are possible: i.e., the algorithm is guaranteed to work correctly under these conditions. (Spin Version 6.2.7 -- 24 February 2014) + Partial Order Reduction Hash-Compact 4 search for: never claim - (none specified) assertion violations + cycle checks - (disabled by -DSAFETY) invalid end states + State-vector 376 byte, depth reached 541, errors: 0 15241309 states, stored 12464615 states, matched 27705924 transitions (= stored+matched) 874965 atomic steps hash conflicts: 3895132 (resolved) Stats on memory usage (in Megabytes): 5872.239 equivalent memory usage for states (stored*(State-vector + overhead)) 581.656 actual memory usage for states (compression: 9.91%) state-vector as stored = 12 byte + 28 byte overhead 128.000 memory used for hash table (-w24) 0.534 memory used for DFS stack (-m10000) 710.175 total actual memory usage unreached in proctype ptest model_n2_m3:43, state 12, "assert(0)" model_n2_m3:234, state 127, "-end-" (2 of 127 states) pan: elapsed time 15.1 seconds pan: rate 1012039.1 states/secondNote that we now explore more reachable states then we did before with the sample reader and writer processes, but the result we obtain here is stronger because we have broadened the scope of the verification to include all four push and pop routines,
and not just two (pushRight() and popLeft()).
In the inline encodings of the push routines we used an assertion (assert(0))
to catch cases where the malloc routine could reach a state where the heap is
exhausted. We've sized the heap large enough for this case that this cannot
happen, and thus the assert(0) statements remains unreachable in the model.
If we repeat this verification by increasing the number of elements pushed into the queue from four (setting N to 2 and M to 4), the result remains unchanged. The complexity of that verification again increases significantly: the model checker must now explore about 248 Million reachable states and takes 5 minutes to render its verdict. More interesting is the case where we increase N from 2 to 3, and keep M equal to 3. This verification too is considerably more complex, but the model checker can identify a counter-example reliably. A default bitstate verification run, for instance, successfully identifies the critical assertion violation after exploring 44 Million reachable states, in about two minutes. pan:1: assertion violated (((v>=0)&&(v<3))&&(val_rcv[v]==0)) (at depth 105) pan: wrote model_n3_m3.trail (Spin Version 6.2.7 -- 24 February 2014) Warning: Search not completed + Partial Order Reduction Bit statespace search for: never claim - (none specified) assertion violations + acceptance cycles - (not selected) invalid end states + State-vector 440 byte, depth reached 1040, errors: 1 44335269 states, stored 50831172 states, matched 95166441 transitions (= stored+matched) 0 atomic steps hash factor: 3.02734 (best if > 100.) bits set per state: 3 (-k3) Stats on memory usage (in Megabytes): 19787.699 equivalent memory usage for states (stored*(State-vector + overhead)) 16.000 memory used for hash array (-w27) 0.076 memory used for bit stack 0.534 memory used for DFS stack (-m10000) 17.413 total actual memory usage pan: elapsed time 116 seconds pan: rate 383788.69 states/secondThe counter-example is similar to the version shown above and the one discussed in [3]. We can also perform a further test and check if, for instance, reducing the number of elements pushed into the queue from 3 to 2 can still cause the algorithm to fail. This turns out not to be the case (using N equal to 3 and M equal to 2): (Spin Version 6.2.7 -- 24 February 2014) + Partial Order Reduction Hash-Compact 4 search for: never claim - (none specified) assertion violations + cycle checks - (disabled by -DSAFETY) invalid end states + State-vector 440 byte, depth reached 2071, errors: 0 5.2636082e+08 states, stored 6.8725985e+08 states, matched 1.2136207e+09 transitions (= stored+matched) 48577981 atomic steps hash conflicts: 5.6197891e+08 (resolved) Stats on memory usage (in Megabytes): 234925.141 equivalent memory usage for states (stored*(State-vector + overhead)) 20639.399 actual memory usage for states (compression: 8.79%) state-vector as stored = 13 byte + 28 byte overhead 2048.000 memory used for hash table (-w28) 0.534 memory used for DFS stack (-m10000) 22688.046 total actual memory usage unreached in proctype ptest model_n3_m2:43, state 12, "assert(0)" model_n3_m2:234, state 127, "-end-" (2 of 127 states) pan: elapsed time 857 seconds pan: rate 614505.49 states/secondThis full verification takes about 15 minutes and explores 526 Million reachable states. Other tests, varying the number of processes or queue elements, can be performed, for instance for N=4 and M=2, but do not reveal any new types of failures. Increasing the number of processes above 3 does increase the verification complexity such that only statistical sampling methods can be used to perform the verifications. As a test we ran a singleThis illustrates the main difference between approaches based on logic model checking, as illustrated here (and as we attempted to illustrate in [1]), and those based on theorem proving techniques, as illustrated in [3]. Logic model checkers can excel in rapidly identifying subtle bugs in small to medium size test cases, while a verification based on theorem proving techniques can be significantly more time consuming but when successful can produce results that hold for arbitrary problem sizes.
Our original objective for using the analysis of the DCAS algorithm as an example in [1] was to show two things: - That the information required by a logic model checker for doing exhaustive proofs of correctness can in many cases be extracted mechanically from C source code, without requiring a user to hand-craft a verification model first.
- That the verification itself, in many cases, can be performed very quickly.
can be extracted mechanically from C sources, but it would be
a stretch to say that the final verification steps for this example can be
executed quickly.
In model checking applications we train ourselves to be highly suspicious of
results that fail to produce counter-example trails, seemingly showing that
an application is correct. This is almost always the right course, since
so few manually designed concurrent algorithms are free from subtle flaws.
As we learn in this example, it is good to be suspicious of
## References-
Holzmann, G., Mars Code,
*Communications of the ACM*, Vol. 57, No. 2, Feb. 2014, pp. 64-73. -
Detlefs, D.L. et al. Even Better DCAS-based concurrent deques. In
*Distributed Algorithms*, LNCS Vol. 1914, M. Herlihy, Ed. Springer-Verlag, Heidelberg, Germany, 2000, pp. 59-73. -
Doherty, S.
*Modelling and Verifying Non-blocking Algorithms that Use Dynamically Allocated Memory*, Master's thesis, Victoria University, Wellington, New Zealand, 2004. -
Lakatos, I.,
*Proofs and Refutations*, 1977.
March 9, 2014. |