This is the fifth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).
The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.
The fourth post “part 4 – a brief history of accident models” looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them. This post looks at weaknesses of of the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.
The limitations of root cause analysis
Once you accept that complex systems can’t have clear and neat links between causes and effects then the idea of root cause analysis becomes impossible to sustain. “Fishbone” cause and effect diagrams (like those used in Six Sigma) illustrate traditional thinking, that it is possible to track back from an adverse event to find a root cause that was both necessary and sufficient to bring it about.
The assumption of linearity with tidy causes and effects is no more than wishful thinking. Like the Domino Model (see “part 4 – a brief history of accident models”) it encourages people to think there is a single cause, and to stop looking when they’ve found it. It doesn’t even offer the insight of the Swiss Cheese Model (also see part 4) that there can be multiple contributory causes, all of them necessary but none of them sufficient to produce an accident. That is a key idea. When complex systems go wrong there is rarely a single cause; causes are necessary, but not sufficient.
Here is a more realistic depiction of what a complex socio-technical system. It is a representation of the operations control system for an airline. The specifics don’t matter. It is simply a good illustration of how messy a real, complex system looks when we try to depict it.
This is actually very similar to the insurance finance applications diagram I drew up for Y2K (see “part 1 – corporate bureaucracies”). There was no neat linearity. My diagram looked just like this, with a similar number of nodes, or systems most of which had multiple two-way interfaces with others. And that was just at the level of applications. There was some intimidating complexity within these systems.
As there is no single cause of failure the search for a root cause can be counter-productive. There are always flaws, bugs, problems, deviances from process, variations. So you can always fix on something that has gone wrong. But it’s not really a meaningful single cause. It’s arbitrary.
The root cause is just where you decide to stop looking. The cause is not something you discover. It’s something you choose and construct. The search for a root cause can mean attention will focus on something that is not inherently dangerous, something that had previously “failed” repeatedly but without any accident. The response might prevent that particular failure and therefore ensure there’s no recurrence of an identical accident. However, introducing a change, even if it’s a fix, to one part of a complex system affects the system in unpredictable ways. The change therefore creates new possibilities for failure that are unknown, even unknowable.
It’s always been hard, even counter-intuitive, to accept that we can have accidents & disasters without any new failure of a component, or even without any technical failure that investigators can identify and without external factors interfering with the system and its operators. We can still have air crashes for which no cause is ever found. The pressure to find an answer, any plausible answer, means there has always been an overwhelming temptation to fix the blame on people, on human error.
Human error – it’s the result of a problem, not the cause
If there’s an accident you can always find someone who screwed up, or who didn’t follow the rules, the standard, or the official process. One problem with that is the same applies when everything goes well. Something that troubled me in audit was realising that every project had problems, every application had bugs when it went live, and there were always deviations from the standards. But the reason smart people were deviating wasn’t that they were irresponsible. They were doing what they had to do to deliver the project. Variation was a sign of success as much as failure. Beating people up didn’t tell us anything useful, and it was appallingly unfair.
One of the rewarding aspects of working as an IT auditor was conducting post-implementation reviews and being able to defend developers who were being blamed unfairly for problem projects. The business would give them impossible jobs, complacently assuming the developers would pick up all the flak for the inevitable problems. When auditors, like me, called them out for being cynical and irresponsible they hated it. They used to say it was because I had a developer background and was angling for my next job. I didn’t care because I was right. Working in a good audit department requires you to build up a thick skin, and some healthy arrogance.
There always was some deviation from standards, and the tougher the challenge the more obvious they would be, but these allegedly deviant developers were the only reason anything was delivered at all, albeit by cutting a few corners.
It’s an ethical issue. Saying the cause of an accident is that people screwed up is opting for an easy answer that doesn’t offer any useful insights for the future and just pushes problems down the line.
Sidney Dekker used a colourful analogy. Dumping the blame on an individual after an accident is “peeing in your pants management” (PDF, opens in new tab).
“You feel relieved, but only for a short while… you start to feel cold and clammy and nasty. And you start stinking. And, oh by the way, you look like a fool.”
Putting the blame on human error doesn’t just stink. It obscures the deeper reasons for failure. It is the result of a problem, not the cause. It also encourages organisations to push for greater automation, in the vain hope that will produce greater safety and predictability, and fewer accidents.
The ironies of automation
An important part of the motivation to automate systems is that humans are seen as unreliable & inefficient. So they are replaced by automation, but that leaves the humans with jobs that are even more complex and even more vulnerable to errors. The attempt to remove errors creates fresh possibilities for even worse errors. As Lisanne Bainbridge wrote in a 1983 paper “The ironies of automation”;
“The more advanced a control system is… the more crucial may be the contribution of the human operator.”
There are all sorts of twists to this. Automation can mean the technology does all the work and operators have to watch a machine that’s in a steady-state, with nothing to respond to. That means they can lose attention & not intervene when they need to. If intervention is required the danger is that vital alerts will be lost if the system is throwing too much information at operators. There is a difficult balance to be struck between denying operators feedback, and thus lulling them into a sense that everything is fine, and swamping them with information. Further, if the technology is doing deeply complicated processing, are the operators really equipped to intervene? Will the system allow operators to override? Bainbridge makes the further point;
“The designer who tries to eliminate the operator still leaves the operator to do the tasks which the designer cannot think how to automate.”
This is a vital point. Systems are becoming more complex and the tasks left to the humans become ever more demanding. System designers have only a very limited understanding of what people will do with their systems. They don’t know. The only certainty is that people will respond and do things that are hard, or impossible, to predict. That is bound to deviate from formal processes, which have been defined in advance, but these deviations, or variations, will be necessary to make the systems work.
Acting on the assumption that these deviations are necessarily errors and “the cause” when a complex socio-technical system fails is ethically wrong. However, there is a further twist to the problem, summed up by the Law of Stretched Systems.
Lawrence Hirschhorn’s Law of Stretched Systems is similar to the Fundamental Law of Traffic Congestion. New roads create more demand to use them, so new roads generate more traffic. Likewise, improvements to systems result in demands that the system, and the people, must do more. Hirschhorn seems to have come up with the law informally, but it has been popularised by the safety critical community, especially by David Woods and Richard Cook.
“Every system operates always at its capacity. As soon as there is some improvement, some new technology, we stretch it.”
And the corollary, furnished by Woods and Cook.
“Under resource pressure, the benefits of change are taken in increased productivity, pushing the system back to the edge of the performance envelope.”
Every change and improvement merely adds to the stress that operators are coping with. The obvious response is to place more emphasis on ergonomics and human factors, to try and ensure that the systems are tailored to the users’ needs and as easy to use as possible. That might be important, but it hardly resolved the problem. These improvements are themselves subject to the Law of Stretched Systems.
This was all first noticed in the 1990s after the First Gulf War. The US Army hadn’t been in serious combat for 18 years. Technology had advanced massively. Throughout the 1980s the army reorganised, putting more emphasis on technology and training. The intention was that the technology should ease the strain on users, reduce fatigue and be as simple to operate as possible. It didn’t pan out that way when the new army went to war. Anthony H. Cordesman and Abraham Wagner analysed in depth the lessons of the conflict. They were particularly interested in how the technology had been used.
“Virtually every advance in ergonomics was exploited to ask military personnel to do more, do it faster, and do it in more complex ways… New tactics and technology simply result in altering the pattern of human stress to achieve a new intensity and tempo of combat.”
Improvements in technology create greater demands on the technology – and the people who operate it. Competitive pressures push companies towards the limits of the system. If you introduce an enhancement to ease the strain on users then managers, or senior officers, will insist on exploiting the change. Complex socio-technical systems always operate at the limits.
This applies not only to soldiers operating high tech equipment. It applies also to the ordinary infantry soldier. In 1860 the British army was worried that troops had to carry 27kg into combat (PDF, opens in new tab). The load has now risen to 58kg. US soldiers have to carry almost 9kg of batteries alone. The Taliban called NATO troops “donkeys”.
These issues don’t apply only to the military. They’ve prompted a huge amount of new thinking in safety critical industries, in particular healthcare and air transport.
The overdose – system behaviour is not explained by the behaviour of its component technology
Remember the traditional argument that any system that was not determimistic was inherently buggy and badly designed? See “part 2 – crucial features of complex systems”.
In reality that applies only to individual components, and even then complexity & thus bugginess can be inescapable. When you’re looking at the whole socio-technical system it just doesn’t stand up.
Introducing new controls, alerts and warnings doesn’t just increase the complexity of the technology as I mentioned earlier with the MIG jet designers (see part 4). These new features add to the burden on the people. Alerts and error message can swamp users of complex systems and they miss the information they really need to know.
I can’t recommend strongly enough the story told by Bob Wachter in “The overdose: harm in a wired hospital”.
A patient at a hospital in California received an overdose of 38½ times the correct amount. Investigation showed that the technology worked fine. All the individual systems and components performed as designed. They flagged up potential errors before they happened. So someone obviously screwed up. That would have been the traditional verdict. However, the hospital allowed Wachter to interview everyone involved in each of the steps. He observed how the systems were used in real conditions, not in a demonstration or test environment. Over five articles he told a compelling story that will force any fair reader to admit “yes, I’d have probably made the same error in those circumstances”.
Happily the patient survived the overdose. The hospital staff involved were not disciplined and were allowed to return to work. The hospital had to think long and hard about how it would try to prevent such mistakes recurring. The uncomfortable truth they had to confront was that there were no simple answers. Blaming human error was a cop out. Adding more alerts would compound the problems staff were already facing; one of the causes of the mistake was the volume of alerts swamping staff making it hard, or impossible, to sift out the vital warnings from the important and the merely useful.
One of the hard lessons was that focussing on making individual components more reliable had harmed the overall system. The story is an important illustration of the maxim in the safety critical community that trying to make systems safer can make them less safe.
Some system changes were required and made, but the hospital realised that the deeper problem was organisational and cultural. They made the brave decision to allow Wachter to publicise his investigation and his series of articles is well worth reading.
The response of the safety critical community to such problems and the necessary trade offs that a practical response requires, is intriguing with important lessons for software testers. I shall turn to this in my next post, “part 6 – Safety II, a new way of looking at safety”.