This is the fourth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).
The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.
This post looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.
Why do accidents happen?
I want to take you back to the part of the world I come from, the east of Scotland. The Tay Bridge is 3.5 km long, the longest railway bridge in the United Kingdom. It’s the second railway bridge over the Tay. The first was opened in 1878 and came down in a storm in 1879, taking a train with it and killing everyone on board.
The stumps of the old bridge were left in place because of concern that removing them would disturb the riverbed. I always felt they were there as a lesson for later generations. Children in that part of Scotland can’t miss these stumps. I remember being bemused when I learned about the disaster. “Mummy, Daddy, what are those things beside the bridge? What…? Why…? How…?” So bridges could fall down. Adults could screw up. Things could go badly wrong. There might not be a happy ending. It was an important lesson in how the world worked for a small child.
Accident investigations are difficult and complex even for something like a bridge, which might appear, at first sight, to be straightforward in concept and function. The various factors that featured in the inquiry report for the Tay Bridge disaster included the bridge design, manufacture of components, construction, maintenance, previous damage, wind speed, train speed and the state of the riverbed.
These factors obviously had an impact on each other. That’s confusing enough, and it’s far worse for complex socio-technical systems. You could argue that a bridge is either safe or unsafe, usable or dangerous. It’s one or the other. There might be argument about where you would draw the line, and about the context in which the bridge might be safe, but most people would be comfortable with the idea of a line. Safety experts call that idea of a line separating the unbroken from the broken as the bimodal principle (not to be confused with Gartner’s bimodal IT management); a system is either working or it is broken.
Thinking in bimodal terms becomes pointless when you are working with systems that run in a constantly flawed state, one of partial failure, when no-one knows how these systems work or even exactly how they should work. This is all increasingly recognised. But when things go wrong and we try to make sense of them there is a huge temptation to arrange our thinking in a way with which we are comfortable, we fall back on mental models that seem to make sense of complexity, however far removed they are from reality. These are the envisioned worlds I mentioned in part 1.
We home in on factors that are politically convenient, the ones that will be most acceptable to the organisation’s culture. We can see this, and also how thinking has developed, by looking at the history of the conceptual models that have been used for accident investigations.
Heinrich’s Domino Model (1931)
The Domino Model was a traditional and very influential way to help safety experts make sense of accidents. Accidents happened because one factor kicked into another, and so on down the line of dominos, as illustrated by Heinrich’s figure 3. Problems with the organisation or environment would lead to people making mistakes and doing something dangerous, which would lead to accidents and injury. It assumed neat linearity & causation. Its attraction was that it appealed to management. Take a look at the next two diagrams in the sequence, figures 4 and 5.
The model explicitly states that taking out the unsafe act will stop an accident. It encouraged investigators to look for a single cause. That was immensely attractive because it kept attention away from any mistakes the management might have made in screwing up the organisation. The chain of collapsing dominos is broken by removing the unsafe act and you don’t get an accident.
The model is consistent with beating up the workers who do the real work. But blaming the workers was only part of the problem with the Domino Model. It was nonsense to think that you could stop accidents by removing unsafe acts, variations from process, or mistakes from the chain. It didn’t have any empirical, theoretical or scientific basis. It was completely inappropriate for complex systems. Thinking in terms of a chain of events was quite simply wrong when analysing these systems. Linearity, neat causation and decomposing problems into constituent parts for separate analysis don’t work.
Despite these fundamental flaws the Domino Model was popular. Or rather, it was popular because of its flaws. It told managers what they wanted to hear. It helped organisations make sense of something they would otherwise have been unable to understand. Accepting that they were dealing with incomprehensible systems was too much to ask.
Swiss Cheese Model (barriers)
James Reason’s Swiss Cheese Model was the next step and it was an advance, but limited. The model did recognise that problems or mistakes wouldn’t necessarily lead to an accident. That would happen only if a series of them lined up. You can therefore stop the accident recurring by inserting a new barrier. However, the model is still based on the idea of linearity and of a sequence of cause and effect, and also the idea that you can and should decompose problems for analysis. This is a dangerously limited way of looking at what goes wrong in complex socio-technical systems, and the danger is very real with safety critical systems.
Of course, there is nothing inherently wrong with analysing systems and problems by decomposing them or relying on an assumption of cause and effect. These both have impeccably respectable histories in science and philosophy. Reducing problems to their component parts has its intellectual roots in the work of Rene Descartes. This approach implies that you can understand the behaviour of a system by looking at the behaviour of its components. Descartes’ approach (the Cartesian) fits neatly with a Newtonian scientific worldview, which holds that it is possible to identify definite causes and effects for everything that happens.
If you want to understand how a machine works and why it is broken, then these approaches are obviously valid. They don’t work when you are dealing with a complex socio-technical system. The whole is quite different from the sum of its parts. Thinking of linear flows is misleading when the different elements of a system are constantly influencing each other and adapting to feedback. Complex systems have unpredictable, emergent properties, and safety is an emergent outcome of complex socio-technical systems.
All designs are a compromise
Something that safety experts are keenly aware of is that all designs are compromises. Adding a new barrier, as envisaged by the Swiss Cheese Model, to try and close off the possibility of an accident can be counter-productive. Introducing a change, even if it’s a fix, to one part of a complex system affects the whole system in unpredictable and possibly harmful ways. The change creates new possibilities for failure, that are unknown, even unknowable.
It’s not a question of regression testing. It’s bigger and deeper than that. The danger is that we create new pathways to failure. The changes might initially seem to work, to be safe, but they can have damaging results further down the line as the system adapts and evolves, as people push the system to the edges.
There’s a second danger. New alerts or controls increase the complexity with which the user has to cope. That was a problem I now recognise with our approach as auditors. We didn’t think through the implications carefully enough. If you keep throwing in fixes, controls and alerts then the user will miss the ones they really need to act on. That reduces the effectiveness, the quality and ultimately the safety of the system. I’ll come back to that later. This is an important paradox. Trying to make a system more reliable and safer can make it more dangerous and less reliable.
The designers of the Soviet Union’s MiG-29 jet fighter observed, “the safest part is the one we could leave off”, (according to Sidney Dekker in his book “The field guide to understanding human error”).
Drift into failure
A friend once commented that she could always spot one of my designs. They reflected a very pessimistic view of the world. I couldn’t know how things would go wrong, I just knew they would and my experience had taught me where to be wary. Working in IT audit made me very cautious. Not only had I completely bought into Murphy’s Law, “anything that can go wrong will go wrong” I had my own variant; “and it will go wrong in ways I couldn’t have imagined”.
William Langewiesche is a writer and former commercial pilot who has written extensively on aviation. He provided an interesting and insightful correction to Murphy, and also to me (from his book “Inside the Sky”).
“What can go wrong usually goes right”.
There are two aspects to this. Firstly, as I have already discussed, complex socio-technical systems are always flawed. They run with problems, bugs, variations from official process, and in the way people behave. Despite all the problems under the surface everything seems to go fine, till one day it all goes wrong.
The second important insight is that you can have an accident even if no individual part of the system has gone wrong. Components may have always worked fine, and continue to do so, but on the day of disaster they combine in unpredictable ways to produce an accident.
Accidents can happen when all the components have been working as designed, not just when they fail. That’s a difficult lesson to learn. I’d go so far as to say we (in software development and engineering and even testing) didn’t want to learn it. However, that’s the reality, however scary it is.
Sidney Dekker developed this idea in a fascinating and important book, “Drift Into Failure”. His argument is that we are developing massively complex systems that we are incapable of understanding. It is therefore misguided to think in terms of system failure arising from a mistake by an operator or the sudden failure of part of the system.
“Drifting into failure is a gradual, incremental decline into disaster driven by environmental pressure, unruly technology and social processes that normalise growing risk. No organisation is exempt from drifting into failure. The reason is that routes to failure trace through the structures, processes and tasks that are necessary to make an organization successful. Failure does not come from the occasional, abnormal dysfunction or breakdown of these structures, processes or tasks, but is an inevitable by-product of their normal functioning. The same characteristics that guarantee the fulfillment of the organization’s mandate will turn out to be responsible for undermining that mandate…
In the drift into failure, accidents can happen without anything breaking, without anybody erring, without anyone violating rules they consider relevant. The idea of systems drifting into failure is a practical illustration of emergence in complex systems. The overall system adapts and changes over time, behaving in ways that could not have been predicted from analysis of the components. The fact that a system has operated safely and successfully in the past does not mean it will continue to do so. Dekker says;
“Empirical success… is no proof of safety. Past success does not guarantee future safety.”
Dekker’s argument about drifting to failure should strike a chord with anyone who has worked in large organisations. Complex systems are kept running by people who have to cope with unpredictable technology, in an environment that increasingly tolerates risk so long as disaster is averted. There is constant pressure to cut costs, to do more and do it faster. Margins are gradually shaved in tiny incremental changes, each of which seems harmless and easy to justify. The prevailing culture assumes that everything is safe, until suddenly it isn’t.
Langewiesche followed up his observation with a highly significant second point;
“What can go wrong usually goes right – and people just naturally draw the wrong conclusions.”
When it all does go wrong the danger is that we look for quick and simple answers. We focus on the components that we notice are flawed, without noticing all the times everything went right even with those flaws. We don’t think about the way people have been keeping systems running despite the problems, or how the culture has been taking them closer and closer to the edge. We then draw the wrong conclusions. Complex systems and the people operating them are constantly being pushed to the limits, which is an important idea that I will return to.
It is vital that we understand this idea of drift, and how people are constantly having to work with complex systems under pressure. Once we start to accept these ideas it starts to become clear that if we want to avoid drawing the wrong conclusions we have to be sceptical about traditional approaches to accident investigation. I’m talking specifically about root cause analysis, and the notion that “human error” is a meaningful explanation for accidents and problems. I will talk about these in my next post, “part 5 – accident investigations and treating people fairly”.