ABORT RETRY FAIL

Article from: New Scientist, No. 2265, 18 November 2000, pp. 41-43.

ABORT RETRY FAIL ABORT RETRY FAIL

4 June 1996, Kourou, French Guiana. The maiden flight of the European Space Agency's Ariane 5. Just 40 seconds after take-off, a software fault causes it to veer off-course, and the rocket blows up. The $450 million worth of satellites on board are destroyed.

10 November 2000, London, England. A student is tearing her hair out in frustration. She wants to finish her essay quickly and meet her friends at the bar. Her computer, however, freezes up every time she presses "Print".

COMPUTERS CRASH. We all know it, we've all experienced it, and somewhere around the globe a computer is probably crashing right now. Sometimes the result is huge financial loss, sometimes it's just a missed appointment. Occasionally, when problems occur in systems monitoring nuclear power stations or air traffic control centres, lives are at risk.

But computer-centred disasters could soon be a thing of the past. Governments are beginning to realise that since computers are behind almost every aspect of life in a modern economy, it might just be time to make them all work properly. Measures adopted by big-budget, high-profile projects such as NASA's space shuttle programme could soon find their way to the humble PC sitting on your desk.

More than a quarter of the $90 million Information Technology Initiative announced by the US National Science Foundation in September is aimed at making computers and their software not faster, not more powerful, but simply reliable. In July, the British government gave €7 million to a collaboration which, in addition to developing crash-free hardware and software, will also research the key cause of computer crashes: human error. "This is the first time anybody has tackled computer dependability on such a broad basis. We're really cutting new turf here," says Cliff Jones of Newcastle University, who is leading the British collaboration.

The first line of computer reliability protects against engineering faults in the hardware by providing back-up power supplies, disc drives and server connections. Mechanical failures are always possible, but these days they are rare. Modern systems can be built to tolerate shocks and extremes of temperature. And as an added layer of security, critical systems like those on the space shuttle will often run three or more sets of hardware in parallel. With that solution firmly in place, the next step towards dependability is rather more difficult: developing the perfect piece of software. Finding and removing errors, or "bugs", from computer programs is time-consuming and costly. According to Tony Hoare, a researcher at Microsoft's Cambridge laboratory, up to three-quarters of the $400 billion spent every year employing computer programmers in the US goes on debugging. Even then, it is rare that a manual check will explore every conceivable action a program may perform.

Modem software's main problem is complexity. Computer programs are often elaborate million-line mathematical statements. Even the best programmers will make a handful of errors in every thousand lines of code they write. And if you change one line in any way, then you have changed the statement: the maths may not make sense any more. When programs don't work because of an inadvertent error, the problem can be almost impossible to find.

Things are now getting even more complicated because a great deal of modern software runs on networked computers. "That's an extra level of complexity because you can't predict the state of the other applications elsewhere in the network," says Gerard Holzmann, a software researcher at Bell Labs Research in Murray Hill, New Jersey.

In an effort to overcome problems like these, the researchers in Bell Labs' Computer Principles Research department have been devising ways to use software to check up on the programmers. Holzmann is currently testing a program called AX (the name stands for "Automata eXtractor") that reads the code of the program under test and uses it to create a "virtual world" for the program to work in. The software for controlling a robot arm, for example, would think that it was moving the arm and getting signals back from the' arm's sensors. But in reality it would simply be connected to another computer running the AX program.

Once this virtual hardware is in place, another program, called Spin, comes into play. This does the back-breaking work of meticulously checking that the program being tested works the way it should. Spin has actually been around for a while-Holzmann released it in 1991--but before AX's arrival, the virtual environment that AX now creates had to be built by a human programmer, an option open only to those with plenty of time and money. Now that AX is on the scene, Spin can put a program through its paces quickly and cheaply. Because AX creates the virtual world from the program code, a problem with the code creates a problem with the world.

Spin looks for occasions where the program doesn't do what it should: when it finds a problem it can trace the fault back, through AX, to the error in the original program. When the programmer has put the fault right, AX creates a corrected world, and Spin checks it again.

Spin has generated particular interest among space-exploration researchers. If bugs slip through their manual software checking process, as happened with the Pathfinder mission to Mars, it's tricky to put things right when software is running on a computer millions of kilometres out in space.

After the Pathfinder mission, Spin checked out the failed software. The results showed that it could have found many of the errors before launch. Spin has since been used to check the control software for the Cassini mission to Saturn and some of the programs for Deep Space 1. "We found software flaws in both cases," says Holzmann. Peter Gluck of NASA's jet Propulsion Laboratory is impressed. "Model checkers like Spin will increase the robustness of the software, save money and time spent testing the software, and hopefully make missions run more smoothly," he says.

AX was initially produced to help test PathStar, a Lucent Technologies server that routes voice and Internet calls across communications networks. Now Holzmann is working to make it available for all commercial software programs. It takes AX a few seconds to convert several thousand lines of source code into a simulation, but as computers become more powerful, Holzmann expects the verification programs to run faster, allowing them to deal with ever more complex programs. Eventually, programs will be checked as the programmer enters the code.

Programs like Spin and AX only solve part of the whole problem, though. They will tell you whether the program correctly performs the operations it's been programmed to perform. But what if the programmer forgets to tell it to do the right thing at the right time? For example, the robot arm might grab an object and take it where it's supposed to go-but if the programmer forgets to include code telling it to relax its grip once that job is done, then chaos could ensue when the arm begins its next task. Leave out an important instruction like this, and the software might seem to work well when it still contains a major fault.

Eggs and baskets

Bev Littlewood, a professor of software engineering at City University in London, who is a partner in Jones's project, is investigating ways round some of the errors that forgetful programmers make. A programmer working alone is almost bound to make a mistake. So you give the problem to a team of programmers, or better still, to more than one team and let them write different programs that can be run simultaneously. "It's the intellectual equivalent of not putting all your eggs in one basket," says Littlewood. The different programs working in parallel should provide a foolproof check for each other. When they don't agree on a result, the odd one out can be ignored.

But there is still a pitfall with this approach, Littlewood has found. His research revealed that even when separate teams of programmers are involved, they are likely to make their mistakes in the same parts of the program - usually a particularly complicated part. Littlewood thinks the answer to this lies in "directed diversity", where the programming teams approach the problem in different ways. They might use different mathematical tricks, or different programming languages. In this way, the strengths of C++ for dealing with one kind of problem can cover the weaknesses of Pascal in that area, and Pascal might come up trumps where C++ fails.

So the hardware's safe and the software's damn-near perfect. But there's still a final unreliable factor in the system: the flesh and bones slumped wearily in front of the flickering monitor. That's why Jones's project involves getting psychologists and cognitive scientists to teach programmers the best way to design human-computer interfaces.

Jones hopes that, as the best ways forward become clear, these solutions will start to filter through into commercial software systems. His project's researchers are collaborating with Britain's National Air Traffic Services, some major financial institutions, and other dependability researchers worldwide. Once bug-free software starts to come through, and people realise that they don't have to put up with programs that crash all the time, Jones believes the whole market will change rapidly.

When this utopia arrives, however, be prepared for the drawbacks. There'll be no more early finishes when the office computers go down. There'll be no more coffee breaks when software failure halts the production lines. There'll be no excuse for missed deadlines or undelivered orders. The crash-free computer could make your life hell.