This is the sixth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).
The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.
The fourth post, “part 4 – a brief history of accident models”, looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.
The fifth post, “part 5 – accident investigations and treating people fairly”, looks at weaknesses of the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.
This post looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.
More safety means less feedback
In 2017 nobody was killed on a scheduled passenger flight (sadly that won’t be the case in 2018). That prompted the South China Morning Post to produce this striking graphic, which I’ve reproduced here in butchered form. Please, please look at the original. My version is just a crude taster.
Increasing safety is obviously good news, but it poses a problem for safety professionals. If you rely on accidents for feedback then reducing accidents will choke off the feedback you need to keep improving, to keep safe. The safer that systems become the less data is available. Remember what William Langewiesche said (see part 4).
“What can go wrong usually goes right – and then people draw the wrong conclusions.”
If accidents have become rare, but are extremely serious when they do occur, then it will be highly counter-productive if investigators pick out people’s actions that deviated from, or adapted, the procedures that management or designers assumed were being followed.
These deviations are always present in complex socio-technical systems that are running successfully and it is misleading to focus on them as if they were a necessary and sufficient cause when there is an accident. The deviations may have been a necessary cause of that particular accident, but in a complex system they were almost certainly not sufficient. These very deviations may have previously ensured the overall system would work. Removing the deviation will not necessarily make the system safer.
There might be fewer opportunities to learn from things going wrong, but there’s a huge amount to learn from all the cases that go right, provided we look. We need to try and understand the patterns, the constraints and the factors that are likely to amplify desired emergent behaviour and those that will dampen the undesirable or dangerous. In order to create a better understanding of how complex socio-technical systems can work safely we have to look at how people are using them when everything works, not just when there are accidents.
Safety II – learning from what goes right
Complex systems and accidents might be beyond our comprehension but that doesn’t mean we should just accept that “shit happens”. That is too flippant and fatalistic, two words that you can never apply to the safety critical people.
Safety I is shorthand for the old safety world view, which focused on failure. Its utility has been hindered by the relative lack of feedback from things going wrong, and the danger that paying insufficient attention to how and why things normally go right will lead to the wrong lessons being learned from the failures that do occur.
Safety I assumed linear cause and effect with root causes (see part 5). It was therefore prone to reaching a dangerously simplistic verdict of human error.
This diagram illustrates the focus of Safety I on the unusual, on the bad outcomes. I have copied, and slight adapted, the Safety I and Safety II diagrams from a document produced by Eurocontrol, (The European Organisation for the Safety of Air Navigation) “From Safety-I to Safety-II: A White Paper” (PDF, opens in new tab).
Incidentally, I don’t know why Safety I and Safety II are routinely illustrated using a normal distribution with the Safety I focus kicking in at two standard deviations. I haven’t been able to find a satisfactory explanation for that. I assume that this is simply for illustrative purposes.
If Safety I wants to prevent bad outcomes, in contrast Safety II looks at how good outcomes are reached. Safety II is rooted in a more realistic understand of complex systems than Safety I and extends the focus to what goes right in systems. That entails a detailed examination of what people are doing with the system in the real world to keep it running. Instead of people being regarded as a weakness and a source of danger, Safety II assumes that people, and the adaptations they introduce to systems and processes, are the very reasons we usually get good, safe outcomes.
If we’ve been involved in the development of the system we might think that we have a good understanding of how the system should be working, but users will always, and rightly, be introducing variations that designers and testers had never envisaged. The old, Safety I, way of thinking regarded these variations as mistakes, but they are needed to keep the systems safe and efficient. We expect systems to be both, which leads on to the next point.
There’s a principle in safety critical systems called ETTO, the Efficiency Thoroughness Trade Off. It was devised by Erik Hollnagel, though it might be more accurate to say he made it explicit and popularised the idea. The idea should be very familiar to people who have worked with complex systems. Hollnagel argues that it is impossible to maximise both efficiency and thoroughness. I’m usually reluctant to cite Wikipedia as a source, but its article on ETTO explains it more succinctly than Hollnagel himself did.
“There is a trade-off between efficiency or effectiveness on one hand, and thoroughness (such as safety assurance and human reliability) on the other. In accordance with this principle, demands for productivity tend to reduce thoroughness while demands for safety reduce efficiency.”
Making the system more efficient makes it less likely that it will achieve its important goals. Chasing these goals comes at the expense of efficiency. That has huge implications for safety critical systems. Safety requires some redundancy, duplication and fallbacks. These are inefficient. Efficiencies eliminate margins of error, with potentially dangerous results.
ETTO recognises the tension between organisations’ need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. In practice, the conflict in goals is usually fully resolved only at the sharp end, where people do the real work and run the systems.
As an example, an airline might offer a punctuality bonus to staff. For an airline safety obviously has the highest priority, but if it was an absolute priority, the only consideration, then it could not contemplate any incentive that would encourage crews to speed up turnarounds on the ground, or to persevere with a landing when prudence would dictate a “go around”. In truth, if safety were an absolute priority, with no element of risk being tolerated, would planes ever take off?
People are under pressure to make the systems efficient, but they are expected to keep the system safe, which inevitably introduces inefficiencies. This tension results in a constant, shifting, pattern of trade-offs and compromises. The danger, as “drift into failure” predicts (see part 4), is that this can lead to a gradual erosion of safety margins.
The old view of safety was to constrain people, reducing variability in the way they use systems. Variability was a human weakness. In Safety II variability in the way that people use the system is seen as a way to ensure the system adapts to stay effective. Humans aren’t seen as a potential weakness, but as a source of flexibility and resilience. Instead of saying “they didn’t follow the set process therefore that caused the accident”, the Safety II approach means asking “why would that have seemed like the right thing to do at the time? Was that normally a safe action?”. Investigations need to learn through asking questions, not making judgments – a lesson it was vital I learned as an inexperienced auditor.
Emergence means that the behaviour of a complex system can’t be predicted from the behaviour of its components. Testers therefore have to think very carefully about when we should apply simple pass or fail criteria. The safety critical community explicitly reject the idea of pass/fail, or the bimodal principle as they call it (see part 4). A flawed component can still be useful. A component working exactly as the designers, and even the users, intended can still contribute to disaster. It all depends on the context, what is happening elsewhere in the system, and testers need to explore the relationships between components and try to learn how people will respond.
Safety is an emergent property of the system. It’s not possible to design it into a system, to build it, or implement it. The system’s rules, controls, and constraints might prevent safety emerging, but they can only enable it. They can create the potential for people to keep the system safe but they cannot guarantee it. Safety depends on user responses and adaptations.
Adaptation means the system is constantly changing as the problem changes, as the environment changes, and as the operators respond to change with their own variations. People manage safety with countless small adjustments.
There is a popular internet meme, “we don’t make mistakes – we do variations”. It is particularly relevant to the safety critical community, who have picked up on it because it neatly encapsulates their thinking, e.g. this article by Steven Shorrock, “Safety-II and Just Culture: Where Now?”. Shorrock, in line with others in the safety critical community, argues that if the corporate culture is to be just and treat people fairly then it is important that the variations that users introduce are understood, rather than being used as evidence to punish them when there is an accident. Pinning the blame on people is not only an abdication of responsibility, it is unjust. As I’ve already argued (see part 5), it’s an ethical issue.
Operator adjustments are vital to keep systems working and safe, which brings us to the idea of trust. A well-designed system has to trust the users to adapt appropriately as the problem changes. The designers and testers can’t know the problems the users will face in the wild. They have to confront the fact that dangerous dragons are lurking in the unknown, and the system has to trust the users with the freedom to stay in the safe zone, clear of the dragons, and out of the disastrous tail of the bell curve that illustrates Safety II.
Safety II and Cynefin
If you’re familiar with Cynefin then you might wonder about Safety II moving away from a focus on the tail of the distribution. Cynefin helps us understand that the tail is where we can find opportunities as well as threats. It’s worth stressing that Safety II does encompass Safety I and the dangerous tail of the distribution. It must not be a binary choice of focusing on either the tail or the body. We have to try to understand not only what happens in the tail, how people and systems can inadvertently end up there, but also what operators do to keep out of the tail.
The Cynefin framework and Safety II share a similar perspective on complexity and the need to allow for, and encourage, variation. I have written about Cynefin elsewhere, e.g. in two articles I wrote for the Association for Software Testing, and there isn’t room to repeat that here. However, I do strongly recommend that testers familiarise themselves with the framework.
To sum it up very briefly, Cynefin helps us to make sense of problems by assigning them to one of four different categories, the obvious, the complicated (the obvious and complicated being related in that problems have predictable causes and resolutions), the complex and the chaotic. Depending on the category different approaches are required. In the case of software development the challenge is to learn more about the problem in order to turn it from a complex activity into a complicated one that we can manage more easily.
Applying Cynefin would result in more emphasis on what’s happening in the tails of the distribution, because that’s where we will find the threats to be avoided and the opportunities to be exploited. Nevertheless, Cynefin isn’t like the old Safety I just because they both focus on the tails. They embody totally different worldviews.
Safety II is an alternative way of looking at accidents, failure and safety. It is not THE definitive way, that renders all others dated, false and heretical. The Safety I approach still has its place, but it’s important to remember its limitations.
Everything flows and nothing abides
Thinking about linear cause and effect, and decomposing components are still vital in helping us understand how different parts of the system work, but they offer only a very limited and incomplete view of what we should be trying to learn. They provide a way of starting to build our understanding, but we mustn’t stop there.
We also have to venture out into the realms of the unknown and often unknowable, to try to understand more about what might happen when the components combine with each other and with humans in complex socio-technical systems. This is when objects become processes, when static elements become part of a flow that is apparent only when we zoom out to take in a bigger picture in time and space.
The idea of understanding objects by stepping back and looking at how they flow and mutate over time has a long, philosophical and scientific history. 2,500 years ago Heraclitus wrote.
“Everything flows and nothing abides. Everything gives way and nothing stays fixed.”
Professor Michael McIntyre (Professor of Atmospheric Dynamics, Cambridge University) put it well in a fascinating BBC documentary, “The secret life of waves”.
“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”
In my next post “part 7 – resilience requires people” I will discuss some of the implications for software testing of the issues I have raised here, in particular how people keep systems going, and dealing with the inevitability of failure. That will lead us to resilience engineering.