Lessons for all Software Endeavours

“Technologists have a responsibility to ensure that technology not only does what it should but it doesn’t do what it shouldn’t.”

I write this as the details of the cause of the Boeing 737 Max air crashes are just starting to appear in the media. I hope that those of us involved in technology will be stimulated by this tragedy to raise our level of professionalism.

Two planes crashed before the pattern was observed and the majority of the world’s 737 Max fleet was grounded. The Ethiopian Airlines flight eight days ago and the Lion Air aircraft in October both crashed within minutes of take-off.

It appears that a system that was designed to prevent a crash may have actually contributed to causing it.

There is a new safety system in these aircraft designed to prevent stalling (caused when a plane is heading steeply upwards). This safety system is called MCAS and it can automatically instruct the autopilot system to change the course of the plane, to head downward.

A single sensor in the nose of the aircraft informs MCAS the current angle of the aircraft. What may have happened is that sensor failed, gave incorrect signals to the MCAS and that then affected the behaviour of the plane, with disastrous consequences.

According to this BBC report the MCAS software is likely to be downgraded to prevent MCAS from having such a strong influence on the behaviour of the plane.

And there will be changes to people and process “There will also be changes to the cockpit warning systems, the flight crew operating manual will be updated and there will be computer-based training for pilots.”

According to this article in the Seattle times some of the risks were known and perhaps there was some commercial pressure to fast track flightworthy approvals. Perhaps a management failing?

What can we learn from this?

Exploring the Root Cause

The successful use of any software involves a combination of: people, process, and technology. A failing such as this may well have arisen due to a combination of mistakes within each of these areas.

People: Were the pilots trained to detect and handle the circumstance of a failing sensor misinforming MCAS of the angle of the plane? Who knew that a failing sensor might cause this problem? What could/did they do about it? Was the risk of a failure scenario known but dismissed as “unlikely to happen”?

Process: Were the pilot’s operating instructions adapted to include sufficient checks of the correct behaviour of the MCAS system? Were pilots informed, or required to know, that this could happen and were they advised on how to handle it?

Complexity: As the technology that we rely upon, becomes ever more sophisticated, it also becomes more complex. In complex systems we rely on just a few people who deeply understand the end to end capability of the technology to build it correctly. Managers and users may not have, nor make, the time to fully understand the complexities of the software that they use or are responsible for. The software may normally perform as it should, but complex systems are at a greater risk of not performing correctly in all circumstances.

When did you last hear “we should take a risk based approach to testing” as a way of shortening schedules?

Technology: This is the area most likely to be under scrutiny. Was the technology to blame? Why was a system dependent on just one sensor allowed to risk the course of the plane? Could a sensor failure be detected? Should there have been more than one sensor, just in case one failed? Why was the logic of the software allowed to rely on a single input?

If the scenario that happened was envisaged by the designers of the aircraft, they might have proposed that the pilot could override such a circumstance. In which case we might have a failing in people and process, not technology.

The technology might have behaved incorrectly, i.e. not to specification, in which case there would have been a failure in implementation and testing.

Alternatively, the technology might have behaved to specification but the specification was wrong. If this is the case one must look again at the requirements. Were the requirements correct?

Impact

The impact to the families who have lost loved ones cannot be measured. In addition to losing their family member they will also suffer direct and indirect financial consequences. Then there is the cost to Boeing, delayed sales, possible compensation to the airlines for lost income as the majority of the 737 fleet is grounded. And then there is the cost to the entire airline industry and other related industries that depend on air travel. The impact of this situation is in the order of many billions of dollars. And it could be down to a poor specification/requirement or a mistake in coding/testing or a mistake in management or a combination of these factors.

Conclusion

In an organisation with such a high quality reputation as Boeing it is unlikely that a single discipline would make such a mistake. The root cause of this disaster is likely to be more than of the above possible areas.

These events should be a stark reminder of anyone involved in technology provision that their activities can have grave consequences. As we become more dependent on greater sophistication from systems, the need of professionalism becomes ever more important.

Only by adopting excellent quality standards to all aspects of management, specification, design, testing and implementation (which includes people and process) can technology be fully trustworthy.

Technologists have a professional responsibility to ensure that technology not only does what it should but it doesn’t do what it shouldn’t. This can only be achieved through a rigorous approach to quality.