The dragons of the unknown; part 7 – resilience requires people

The dragons of the unknown; part 7 – resilience requires people

Introduction

This is the seventh post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “facing the dragons part 1 – corporate bureaucracies”. Part 2 was about the nature of complex systems. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “I don’t know what’s going on”.

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “accident investigations and treating people fairly”, looked at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

Part six “Safety II, a new way of looking at safety” looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

This post is about the importance of system resilience and the vital role that people play in keeping systems going.

Robustness versus resilience

The idea of resilience is where Safety II and Cynefin come together in a practical way for software development.sea wall collapse Safety critical professionals have become closely involved in the field of resilience engineering. Dave Snowden, Cynefin’s creator, places great emphasis on the need for systems in complex environments to be resilient.

First, I’d like to make an important distinction, between robustness and resilience. The example Snowden uses is that a seawall is robust but a salt marsh is resilient. A seawall is a barrier to large waves and storms. It protects the land behind, but if it fails it does so catastrophically. A salt marsh protects inland areas by acting as a buffer, absorbing storm waves rather than repelling them. It might deteriorate over time but it won’t fail suddenly and disastrously.saltmarsh

Designing for robustness entails trying to prevent failure. Designing for resilience recognises that failure is inevitable in some form but tries to make that failure containable. Resilience means that recovery should be swift and relatively easy when it does occur, and crucially, it means that failure can be detected quickly, or even in advance so that operators have a chance to respond.

What struck me about the resilience engineering approach is that it matches the way that we managed the highly complex insurance financial applications I mentioned in “part 2 – crucial features of complex systems”. We had never heard of resilience engineering, but the site standards were of limited use. We had to feel our way, finding an appropriate response as the rapid pace of development created terrifying new complexity on top of a raft of ancient legacy applications.

The need for efficient processing of the massive batch runs had to be balanced against the need to detect the constant flow of small failures early, to stop them turning into major problems, and also against the pressing need to facilitate recovery when we inevitably hit serious failure. We also had to think about what “failure” really meant in a context where 100% (or even 98%) accuracy was an unrealistic dream that would distract us from providing flawed but valuable systems to our users within the timescales that were dictated by commercial pressures.

An increasing challenge for testers will be to look for information about how systems fail, and test for resilience rather than robustness. Liz Keogh, in this talk on “Safe-to-Fail” makes a similar point.

“Testers are really, really good at spotting failure scenarios… they are awesomely imaginative at calamity… Devs are problem solvers. They spot patterns. Testers spot holes in patterns… I have a theory that other people who are in critical positions, like compliance and governance people are also really good at this.”

Developing for resilence means that tolerance for failure becomes more important than a vain attempt to prevent failure altogether. This tolerance often requires greater redundancy. Stripping out redundancy and maximizing the efficiency of systems has a downside. Greater efficiency can make applications brittle and inflexible. When problems hit they hit hard and recovery can be difficult.

However, redundancy itself adds to the complexity of systems and can create unexpected ways for them to fail. In our massively complex insurance finance systems a constant threat was that the safeguards we introduced to make the systems resilient might result in the processing runs failing to complete in time and disrupting other essential applications.

The ETTO principle (see part 6 , “Safety II – learning from what goes right”) describes the dilemmas we were constantly having to deal with. But the problems we faced were more complex than a simple trade off, sacrificing efficiency would not necessarily lead to greater effectiveness. Poorly thought out safeguards could harm both efficiency and effectiveness.

We had to nurse those systems carefully. That is a crucial idea to understand. Complex systems require constant attention by skilled people and these people are an indispensable means of controlling the systems.

Ashby’s Law of Requisite Variety

Ashby’s Law of Requisite Variety is also known as The First Law of Cybernetics.

“The complexity of a control system must be equal to or greater than the complexity of the system it controls.”

A stable system needs as much variety in the control mechanisms as there is in the system itself. This does not mean as much variety as the external reality that the system is attempting to manage – a thermostat is just on or off, it isn’t directly controlling the temperature, just whether the heating is on or off.

The implication for complex socio-technical systems is that the controlling mechanism must include humans if it is to be stable precisely because the system includes humans. The control mechanism has to be as complex and sophisticated as the system itself. It’s one of those “laws” that looks trivially obvious when it is unpacked, but whose implications can easily be missed unless we turn our minds to the problem and its implications. conductorWe must therefore trust expertise, trust the expert operators, and learn what they have to do to keep the system running.

I like the analogy of an orchestra’s conductor. It’s a flawed analogy (all models are flawed, though some are useful). The point is that you need a flexible, experienced human to make sense of the complexity and constantly adjust the system to keep it working and useful.

Really know the users

I have learned that it is crucially important to build a deep understanding of the user representatives and the world they work in. This is often not possible, but when I have been able to do it the effort has always paid off. If you can find good contacts in the user community you can learn a huge amount from them. Respect deep expertise and try to acquire it yourself if possible.

When I moved into the world of insurance finance systems I had very bright, enthusiastic, young (but experienced) users who took the time to immerse me in their world. I was responsible for the development, not just the testing. The users wanted me to understand them, their motivation, the pressures on them, where they wanted to get to, the risks they worried about, what kept them awake at night. It wasn’t about record-keeping. It was all about understanding risks and exposures. They wanted to set prices accurately, to compete aggressively using cheap prices for good risks and high prices for the poor risks.

That much was obvious, but I hadn’t understood the deep technical problems and complexities of unpacking the risk factors and the associated profits and losses. Understanding those problems and the concerns of my users was essential to delivering something valuable. The time spent learning from them allowed me to understand not only why imperfection was acceptable and chasing perfection was dangerous, but also what sort of imperfection was acceptable.

Building good, lasting relations with my users was perhaps the best thing I ever did for my employers and it paid huge dividends over the next few years.

We shouldn’t be thinking only about the deep domain experts though. It’s also vital to look at what happens at the sharp end with operational users, perhaps lowly and stressed, carrying out the daily routine. If we don’t understand these users, the pressures and distractions they face, and how they have to respond then we don’t understand the system that matters, the wider complex, socio-technical system.

Testers should be trying to learn more from experts working on human factors and ergonomics and user experience. I’ll finish this section with just a couple of examples of the level of thought and detail that such experts put into the design of aeroplane cockpits.

Boeing is extremely concerned about the danger of overloading cockpit crew with so much information that they pay insufficient attention to the most urgent warnings. The designers therefore only use the colour red in cockpits when the pilot has to take urgent action to keep the plane safe. Red appears only for events like engine fires and worse. Less urgent alerts use other colours and are less dramatic. Pilots know that if they ever see a red light or red text then they have to act. [The original version of this was written before the Boeing 737 MAX crashes. These have raised concerns about Boeing’s practices that I will return to later.]

A second and less obvious example of the level of detailed thought that goes into flight deck designs is that analog speed dials are widely considered safer than digital displays. Pilots can glance at the dial and see that the airspeed is in the right zone given all the other factors (e.g. height, weight and distance to landing) at the same time as they are processing a blizzard of other information.

A digital display isn’t as valuable. (See Edwin Hutchins’ “How a cockpit remembers its speeds“, Cognitive Science 19, 1995.) It might offer more precise information, but it is less useful to pilots when they really need to know about the aircraft’s speed during a landing, a time when they have to deal with many other demands. In a highly complex environment it is more important to be useful than accurate. Safe is more important than precise.speed dial

The speed dial that I have used as an illustration is also a good example both of beneficial user variations and of the perils of piling in extra features. The tabs surrounding the dial are known as speed bugs. Originally pilots improvised with grease pencils or tape to mark the higher and lower limits of the airspeed that would be safe for landing that flight. Designers picked up on that and added movable plastic tabs. Unfortunately, they went too far and added tabs for every eventuality, thus bringing visual clutter into what had been a simple solution. (See Donald Norman’s “Turn signals are the facial expressions of automobiles“, chapter 16, “Coffee cups in the cockpit”, Basic Books, 1993.)

We need a better understanding of what will help people make the system work, and what is likely to trip them up. That entails respect for the users and their expertise. We must not only trust them we must never lose our own humility about what we can realistically know.

As Jens Rasmussen put it (in a much quoted talk at the IEEE Standards Workshop on Human Factors and Nuclear Safety in 1981 – I have not been able to track this down).

“The operator’s role is to make up for holes in designers’ work.”

Testers should be ready to explore and try to explain these holes, the gap between the designers’ limited knowledge and the reality that the users will deal with. We have to try to think about what the system as found will be like. We must not restrict ourselves to the system as imagined.

Lessons from resilience engineering

There is a huge amount to learn from resilience engineering. This community has a significant overlap with the safety critical community. The resilience engineering literature is vast and growing. However, for a quick flavour of what might be useful for testers it’s worth looking at the four principles of Erik Hollnagel’s Functional Resonance Analysis Method (FRAM). FRAM tries to provide a way to model complex socio-technical systems so that we can gain a better understanding of likely outcomes.

    • Success and failure are equivalent. They can happen in similar ways.

      It is dangerously misleading to assume that the system is bimodal, that it is either working or broken. Any factor that is present in a failure can equally be present in success.

    • Success, failure and normal outcomes are all emergent qualities of the whole system.

      We cannot learn about what will happen in a complex system by observing only the individual components.

    • People must constantly make small adjustments to keep the system running.

      These changes are both essential for the successful operation of the system, but also a contributory cause of failure. Changes are usually approximate adjustments, based on experience, rather than precise, calculated changes. An intrinsic feature of complex systems is that small changes can have a dramatic effect on the overall system. A change to one variable or function will always affect others.

    • “Functional resonance” is the detectable result of unexpected interaction of normal variations.

Functional resonance is a particularly interesting concept. Resonance is the engineering term for the effect we get when different things vibrate with the same frequency. If an object is struck or moved suddenly it will vibrate at its natural frequency. If the object producing the force is also vibrating at the same frequency the result is resonance, and the effect of the impact can be amplified dramatically.Albert Bridge warning notice

Resonance is the effect you see if you push a child on a swing. If your pushes match the motion of the swing you quickly amplify the motion. If your timing is wrong you dampen the swing’s motion. Resonance can produce unpredictable results. A famous example is the danger that marching troops can bring a bridge down if the rhythm of their marching coincides with the natural frequency at which the bridge vibrates.

Learning about functional resonance means learning about the way that different variables combine to amplify or dampen the effect that each has, producing outcomes that would have been entirely unpredictable from looking at their behaviour individually.

Small changes can lead to drastically different outcomes at different times depending on what else is happening. The different variables in the system will be coupled in potentially significant ways the designers did not understand. These variables can reinforce, or play off each other, unpredictably.

Safety is a control problem – a matter of controlling these interactions, which means we have to understand them first. But, as we have seen, the answer can’t be to keep adding controls to try and achieve greater safety. Safety is not only a control problem, it is also an emergent and therefore unpredictable property (see appendix). That’s not a comfortable combination for the safety critical community.

Although it is impossible to predict emergent behaviour in a complex system it is possible to learn about the sort of impact that changes and user actions might have. FRAM is not a model for testers. However, it does provide a useful illustration of the approach being taken by safety experts who are desperate to learn and gain a better understanding of how systems might work.

Good testers are surely well placed to reach out and offer their skills and experience. It is, after all, the job of testers to learn about systems and tell a “compelling story” (as Messrs Bach & Bolton put it) to the people who need to know. They need the feedback that we can provide, but if it is to be useful we all have to accept that it cannot be exact.

Lotfi Zadeh, a US mathematician, computer scientist and engineer introduced the idea of fuzzy logic. He made this deeply insightful observation, quoted in Daniel McNeill and Paul Freiberger’s book “Fuzzy Logic”.

“As complexity rises, precise statements lose meaning, and meaningful statements lose precision.”

Zadeh’s maxim has come to be known as the Law of Incompatibility. If we are dealing with complex socio-technical systems we can be meaningful or we can be precise. We cannot be both; they are incompatible in such a context. It might be hard to admit we can say nothing with certainty, but the truth is that meaningful statements cannot be precise. If we say “yes, we know” then we are misleading the people who are looking for guidance. To pretend otherwise is bullshitting.

In the eighth post of this series, “How we look at complex systems”, I will talk about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

Appendix – is safety an emergent property?

In this series I have repeatedly referred to safety as being an emergent property of complex adaptive systems. For beginners trying to get their heads round this subject it is an important point to take on board.

However, the nature of safety is rather more nuanced. Erik Hollnagel argues that safety is a state of the whole system, rather than one of the system’s properties. Further, we consciously work towards that state of safety, trying to manipulate the system to achieve the desired state. Therefore safety is not emergent; it is a resultant state, a deliberate result. On the other hand, a lack of safety is an emergent property because it arises from unpredictable and undesirable adaptions of the system and its users.

Other safety experts differ and regard safety as being emergent. For the purpose of this blog I will stick with the idea that it is emergent. However, it is worth bearing Hollnagel’s argument in mind. I am quite happy to think of safety being a state of a system because my training and experience lead me to think of states as being more transitory than properties, but I don’t feel sufficiently strongly to stop referring to safety as being an emergent property.

The dragons of the unknown; part 6 – Safety II, a new way of looking at safety

Introduction

This is the sixth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

The fourth post, “part 4 – a brief history of accident models”, looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “part 5 – accident investigations and treating people fairly”, looks at weaknesses of the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

This post looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

More safety means less feedback

2017 - safest year in aviation historyIn 2017 nobody was killed on a scheduled passenger flight (sadly that won’t be the case in 2018). That prompted the South China Morning Post to produce this striking graphic, which I’ve reproduced here in butchered form. Please, please look at the original. My version is just a crude taster.

Increasing safety is obviously good news, but it poses a problem for safety professionals. If you rely on accidents for feedback then reducing accidents will choke off the feedback you need to keep improving, to keep safe. The safer that systems become the less data is available. Remember what William Langewiesche said (see part 4).

“What can go wrong usually goes right – and then people draw the wrong conclusions.”

If accidents have become rare, but are extremely serious when they do occur, then it will be highly counter-productive if investigators pick out people’s actions that deviated from, or adapted, the procedures that management or designers assumed were being followed.

These deviations are always present in complex socio-technical systems that are running successfully and it is misleading to focus on them as if they were a necessary and sufficient cause when there is an accident. The deviations may have been a necessary cause of that particular accident, but in a complex system they were almost certainly not sufficient. These very deviations may have previously ensured the overall system would work. Removing the deviation will not necessarily make the system safer.

There might be fewer opportunities to learn from things going wrong, but there’s a huge amount to learn from all the cases that go right, provided we look. We need to try and understand the patterns, the constraints and the factors that are likely to amplify desired emergent behaviour and those that will dampen the undesirable or dangerous. In order to create a better understanding of how complex socio-technical systems can work safely we have to look at how people are using them when everything works, not just when there are accidents.

Safety II – learning from what goes right

Complex systems and accidents might be beyond our comprehension but that doesn’t mean we should just accept that “shit happens”. That is too flippant and fatalistic, two words that you can never apply to the safety critical people.

Safety I is shorthand for the old safety world view, which focused on failure. Its utility has been hindered by the relative lack of feedback from things going wrong, and the danger that paying insufficient attention to how and why things normally go right will lead to the wrong lessons being learned from the failures that do occur.

Safety ISafety I assumed linear cause and effect with root causes (see part 5). It was therefore prone to reaching a dangerously simplistic verdict of human error.

This diagram illustrates the focus of Safety I on the unusual, on the bad outcomes. I have copied, and slightly adapted, the Safety I and Safety II diagrams from a document produced by Eurocontrol, (The European Organisation for the Safety of Air Navigation) “From Safety-I to Safety-II: A White Paper” (PDF, opens in new tab).

Incidentally, I don’t know why Safety I and Safety II are routinely illustrated using a normal distribution with the Safety I focus kicking in at two standard deviations. I haven’t been able to find a satisfactory explanation for that. I assume that this is simply for illustrative purposes.

Safety IIIf Safety I wants to prevent bad outcomes, in contrast Safety II looks at how good outcomes are reached. Safety II is rooted in a more realistic understand of complex systems than Safety I and extends the focus to what goes right in systems. That entails a detailed examination of what people are doing with the system in the real world to keep it running. Instead of people being regarded as a weakness and a source of danger, Safety II assumes that people, and the adaptations they introduce to systems and processes, are the very reasons we usually get good, safe outcomes.

If we’ve been involved in the development of the system we might think that we have a good understanding of how the system should be working, but users will always, and rightly, be introducing variations that designers and testers had never envisaged. The old, Safety I, way of thinking regarded these variations as mistakes, but they are needed to keep the systems safe and efficient. We expect systems to be both, which leads on to the next point.

There’s a principle in safety critical systems called ETTO, the Efficiency Thoroughness Trade Off. It was devised by Erik Hollnagel, though it might be more accurate to say he made it explicit and popularised the idea. The idea should be very familiar to people who have worked with complex systems. Hollnagel argues that it is impossible to maximise both efficiency and thoroughness. I’m usually reluctant to cite Wikipedia as a source, but its article on ETTO explains it more succinctly than Hollnagel himself did.

“There is a trade-off between efficiency or effectiveness on one hand, and thoroughness (such as safety assurance and human reliability) on the other. In accordance with this principle, demands for productivity tend to reduce thoroughness while demands for safety reduce efficiency.”

Making the system more efficient makes it less likely that it will achieve its important goals. Chasing these goals comes at the expense of efficiency. That has huge implications for safety critical systems. Safety requires some redundancy, duplication and fallbacks. These are inefficient. Efficiencies eliminate margins of error, with potentially dangerous results.

ETTO recognises the tension between organisations’ need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. In practice, the conflict in goals is usually fully resolved only at the sharp end, where people do the real work and run the systems.

airline job adAs an example, an airline might offer a punctuality bonus to staff. For an airline safety obviously has the highest priority, but if it was an absolute priority, the only consideration, then it could not contemplate any incentive that would encourage crews to speed up turnarounds on the ground, or to persevere with a landing when prudence would dictate a “go around”. In truth, if safety were an absolute priority, with no element of risk being tolerated, would planes ever take off?

People are under pressure to make the systems efficient, but they are expected to keep the system safe, which inevitably introduces inefficiencies. This tension results in a constant, shifting, pattern of trade-offs and compromises. The danger, as “drift into failure” predicts (see part 4), is that this can lead to a gradual erosion of safety margins.

The old view of safety was to constrain people, reducing variability in the way they use systems. Variability was a human weakness. In Safety II variability in the way that people use the system is seen as a way to ensure the system adapts to stay effective. Humans aren’t seen as a potential weakness, but as a source of flexibility and resilience. Instead of saying “they didn’t follow the set process therefore that caused the accident”, the Safety II approach means asking “why would that have seemed like the right thing to do at the time? Was that normally a safe action?”. Investigations need to learn through asking questions, not making judgments – a lesson it was vital I learned as an inexperienced auditor.

Emergence means that the behaviour of a complex system can’t be predicted from the behaviour of its components. Testers therefore have to think very carefully about when we should apply simple pass or fail criteria. The safety critical community explicitly reject the idea of pass/fail, or the bimodal principle as they call it (see part 4). A flawed component can still be useful. A component working exactly as the designers, and even the users, intended can still contribute to disaster. It all depends on the context, what is happening elsewhere in the system, and testers need to explore the relationships between components and try to learn how people will respond.

Safety is an emergent property of the system. It’s not possible to design it into a system, to build it, or implement it. The system’s rules, controls, and constraints might prevent safety emerging, but they can only enable it. They can create the potential for people to keep the system safe but they cannot guarantee it. Safety depends on user responses and adaptations.

Adaptation means the system is constantly changing as the problem changes, as the environment changes, and as the operators respond to change with their own variations. People manage safety with countless small adjustments.

we don't make mistakesThere is a popular internet meme, “we don’t make mistakes – we do variations”. It is particularly relevant to the safety critical community, who have picked up on it because it neatly encapsulates their thinking, e.g. this article by Steven Shorrock, “Safety-II and Just Culture: Where Now?”. Shorrock, in line with others in the safety critical community, argues that if the corporate culture is to be just and treat people fairly then it is important that the variations that users introduce are understood, rather than being used as evidence to punish them when there is an accident. Pinning the blame on people is not only an abdication of responsibility, it is unjust. As I’ve already argued (see part 5), it’s an ethical issue.

Operator adjustments are vital to keep systems working and safe, which brings us to the idea of trust. A well-designed system has to trust the users to adapt appropriately as the problem changes. The designers and testers can’t know the problems the users will face in the wild. They have to confront the fact that dangerous dragons are lurking in the unknown, and the system has to trust the users with the freedom to stay in the safe zone, clear of the dragons, and out of the disastrous tail of the bell curve that illustrates Safety II.

Safety II and Cynefin

If you’re familiar with Cynefin then you might wonder about Safety II moving away from a focus on the tail of the distribution. Cynefin helps us understand that the tail is where we can find opportunities as well as threats. It’s worth stressing that Safety II does encompass Safety I and the dangerous tail of the distribution. It must not be a binary choice of focusing on either the tail or the body. We have to try to understand not only what happens in the tail, how people and systems can inadvertently end up there, but also what operators do to keep out of the tail.

The Cynefin framework and Safety II share a similar perspective on complexity and the need to allow for, and encourage, variation. I have written about Cynefin elsewhere, e.g. in two articles I wrote for the Association for Software Testing, and there isn’t room to repeat that here. However, I do strongly recommend that testers familiarise themselves with the framework.

To sum it up very briefly, Cynefin helps us to make sense of problems by assigning them to one of four different categories, the obvious, the complicated (the obvious and complicated being related in that problems have predictable causes and resolutions), the complex and the chaotic. Depending on the category different approaches are required. In the case of software development the challenge is to learn more about the problem in order to turn it from a complex activity into a complicated one that we can manage more easily.

Applying Cynefin would result in more emphasis on what’s happening in the tails of the distribution, because that’s where we will find the threats to be avoided and the opportunities to be exploited. Nevertheless, Cynefin isn’t like the old Safety I just because they both focus on the tails. They embody totally different worldviews.

Safety II is an alternative way of looking at accidents, failure and safety. It is not THE definitive way, that renders all others dated, false and heretical. The Safety I approach still has its place, but it’s important to remember its limitations.

Everything flows and nothing abides

Thinking about linear cause and effect, and decomposing components are still vital in helping us understand how different parts of the system work, but they offer only a very limited and incomplete view of what we should be trying to learn. They provide a way of starting to build our understanding, but we mustn’t stop there.

We also have to venture out into the realms of the unknown and often unknowable, to try to understand more about what might happen when the components combine with each other and with humans in complex socio-technical systems. This is when objects become processes, when static elements become part of a flow that is apparent only when we zoom out to take in a bigger picture in time and space.

The idea of understanding objects by stepping back and looking at how they flow and mutate over time has a long, philosophical and scientific history. 2,500 years ago Heraclitus wrote.

“Everything flows and nothing abides. Everything gives way and nothing stays fixed.”

Professor Michael McIntyre (Professor of Atmospheric Dynamics, Cambridge University) put it well in a fascinating BBC documentary, “The secret life of waves”.

“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”

In my next post “part 7 – resilience requires people” I will discuss some of the implications for software testing of the issues I have raised here, in particular how people keep systems going, and dealing with the inevitability of failure. That will lead us to resilience engineering.everything flows

The dragons of the unknown; part 5 – accident investigations and treating people fairly

Introduction

This is the fifth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

The fourth post “part 4 – a brief history of accident models” looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them. This post looks at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

The limitations of root cause analysis

root cause (fishbone) diagram

Once you accept that complex systems can’t have clear and neat links between causes and effects then the idea of root cause analysis becomes impossible to sustain. “Fishbone” cause and effect diagrams (like those used in Six Sigma) illustrate traditional thinking, that it is possible to track back from an adverse event to find a root cause that was both necessary and sufficient to bring it about.

The assumption of linearity with tidy causes and effects is no more than wishful thinking. Like the Domino Model (see “part 4 – a brief history of accident models”) it encourages people to think there is a single cause, and to stop looking when they’ve found it. It doesn’t even offer the insight of the Swiss Cheese Model (also see part 4) that there can be multiple contributory causes, all of them necessary but none of them sufficient to produce an accident. That is a key idea. When complex systems go wrong there is rarely a single cause; causes are necessary, but not sufficient.

complex airline system

Here is a more realistic depiction of what a complex socio-technical system. It is a representation of the operations control system for an airline. The specifics don’t matter. It is simply a good illustration of how messy a real, complex system looks when we try to depict it.

This is actually very similar to the insurance finance applications diagram I drew up for Y2K (see “part 1 – corporate bureaucracies”). There was no neat linearity. My diagram looked just like this, with a similar number of nodes, or systems most of which had multiple two-way interfaces with others. And that was just at the level of applications. There was some intimidating complexity within these systems.

As there is no single cause of failure the search for a root cause can be counter-productive. There are always flaws, bugs, problems, deviances from process, variations. So you can always fix on something that has gone wrong. But it’s not really a meaningful single cause. It’s arbitrary.

The root cause is just where you decide to stop looking. The cause is not something you discover. It’s something you choose and construct. The search for a root cause can mean attention will focus on something that is not inherently dangerous, something that had previously “failed” repeatedly but without any accident. The response might prevent that particular failure and therefore ensure there’s no recurrence of an identical accident. However, introducing a change, even if it’s a fix, to one part of a complex system affects the system in unpredictable ways. The change therefore creates new possibilities for failure that are unknown, even unknowable.

It’s always been hard, even counter-intuitive, to accept that we can have accidents & disasters without any new failure of a component, or even without any technical failure that investigators can identify and without external factors interfering with the system and its operators. We can still have air crashes for which no cause is ever found. The pressure to find an answer, any plausible answer, means there has always been an overwhelming temptation to fix the blame on people, on human error.

Human error – it’s the result of a problem, not the cause

If there’s an accident you can always find someone who screwed up, or who didn’t follow the rules, the standard, or the official process. One problem with that is the same applies when everything goes well. Something that troubled me in audit was realising that every project had problems, every application had bugs when it went live, and there were always deviations from the standards. But the reason smart people were deviating wasn’t that they were irresponsible. They were doing what they had to do to deliver the project. Variation was a sign of success as much as failure. Beating people up didn’t tell us anything useful, and it was appallingly unfair.

One of the rewarding aspects of working as an IT auditor was conducting post-implementation reviews and being able to defend developers who were being blamed unfairly for problem projects. The business would give them impossible jobs, complacently assuming the developers would pick up all the flak for the inevitable problems. When auditors, like me, called them out for being cynical and irresponsible they hated it. They used to say it was because I had a developer background and was angling for my next job. I didn’t care because I was right. Working in a good audit department requires you to build up a thick skin, and some healthy arrogance.

There always was some deviation from standards, and the tougher the challenge the more obvious they would be, but these allegedly deviant developers were the only reason anything was delivered at all, albeit by cutting a few corners.

It’s an ethical issue. Saying the cause of an accident is that people screwed up is opting for an easy answer that doesn’t offer any useful insights for the future and just pushes problems down the line.

Sidney Dekker used a colourful analogy. Dumping the blame on an individual after an accident is “peeing in your pants management” (PDF, opens in new tab).

“You feel relieved, but only for a short while… you start to feel cold and clammy and nasty. And you start stinking. And, oh by the way, you look like a fool.”

Putting the blame on human error doesn’t just stink. It obscures the deeper reasons for failure. It is the result of a problem, not the cause. It also encourages organisations to push for greater automation, in the vain hope that will produce greater safety and predictability, and fewer accidents.

The ironies of automation

An important part of the motivation to automate systems is that humans are seen as unreliable and inefficient. So they are replaced by automation, but that leaves the humans with jobs that are even more complex and even more vulnerable to errors. The attempt to remove errors creates fresh possibilities for even worse errors. As Lisanne Bainbridge wrote in a 1983 paper “The ironies of automation” (PDF, opens in new tab);

“The more advanced a control system is… the more crucial may be the contribution of the human operator.”

There are all sorts of twists to this. Automation can mean the technology does all the work and operators have to watch a machine that’s in a steady-state, with nothing to respond to. That means they can lose attention & not intervene when they need to. If intervention is required the danger is that vital alerts will be lost if the system is throwing too much information at operators. There is a difficult balance to be struck between denying operators feedback, and thus lulling them into a sense that everything is fine, and swamping them with information. Further, if the technology is doing deeply complicated processing, are the operators really equipped to intervene? Will the system allow operators to override? Bainbridge makes the further point;

“The designer who tries to eliminate the operator still leaves the operator to do the tasks which the designer cannot think how to automate.”

This is a vital point. Systems are becoming more complex and the tasks left to the humans become ever more demanding. System designers have only a very limited understanding of what people will do with their systems. They don’t know. The only certainty is that people will respond and do things that are hard, or impossible, to predict. That is bound to deviate from formal processes, which have been defined in advance, but these deviations, or variations, will be necessary to make the systems work.

Acting on the assumption that these deviations are necessarily errors and “the cause” when a complex socio-technical system fails is ethically wrong. However, there is a further twist to the problem, summed up by the Law of Stretched Systems.

Stretched systems

Lawrence Hirschhorn’s Law of Stretched Systems is similar to the Fundamental Law of Traffic Congestion. New roads create more demand to use them, so new roads generate more traffic. Likewise, improvements to systems result in demands that the system, and the people, must do more. Hirschhorn seems to have come up with the law informally, but it has been popularised by the safety critical community, especially by David Woods and Richard Cook.

“Every system operates always at its capacity. As soon as there is some improvement, some new technology, we stretch it.”

And the corollary, furnished by Woods and Cook.

“Under resource pressure, the benefits of change are taken in increased productivity, pushing the system back to the edge of the performance envelope.”

Every change and improvement merely adds to the stress that operators are coping with. The obvious response is to place more emphasis on ergonomics and human factors, to try and ensure that the systems are tailored to the users’ needs and as easy to use as possible. That might be important, but it hardly resolved the problem. These improvements are themselves subject to the Law of Stretched Systems.

This was all first noticed in the 1990s after the First Gulf War. The US Army hadn’t been in serious combat for 18 years. Technology had advanced massively. Throughout the 1980s the army reorganised, putting more emphasis on technology and training. The intention was that the technology should ease the strain on users, reduce fatigue and be as simple to operate as possible. It didn’t pan out that way when the new army went to war. Anthony H. Cordesman and Abraham Wagner analysed in depth the lessons of the conflict. They were particularly interested in how the technology had been used.

“Virtually every advance in ergonomics was exploited to ask military personnel to do more, do it faster, and do it in more complex ways… New tactics and technology simply result in altering the pattern of human stress to achieve a new intensity and tempo of combat.”

Improvements in technology create greater demands on the technology – and the people who operate it. Competitive pressures push companies towards the limits of the system. If you introduce an enhancement to ease the strain on users then managers, or senior officers, will insist on exploiting the change. Complex socio-technical systems always operate at the limits.

This applies not only to soldiers operating high tech equipment. It applies also to the ordinary infantry soldier. In 1860 the British army was worried that troops had to carry 27kg into combat (PDF, opens in new tab). The load has now risen to 58kg. US soldiers have to carry almost 9kg of batteries alone. The Taliban called NATO troops “donkeys”.

These issues don’t apply only to the military. They’ve prompted a huge amount of new thinking in safety critical industries, in particular healthcare and air transport.

The overdose – system behaviour is not explained by the behaviour of its component technology

Remember the traditional argument that any system that was not determimistic was inherently buggy and badly designed? See “part 2 – crucial features of complex systems”.

In reality that applies only to individual components, and even then complexity & thus bugginess can be inescapable. When you’re looking at the whole socio-technical system it just doesn’t stand up.

Introducing new controls, alerts and warnings doesn’t just increase the complexity of the technology as I mentioned earlier with the MIG jet designers (see part 4). These new features add to the burden on the people. Alerts and error message can swamp users of complex systems and they miss the information they really need to know.

I can’t recommend strongly enough the story told by Bob Wachter in “The overdose: harm in a wired hospital”.

A patient at a hospital in California received an overdose of 38½ times the correct amount. Investigation showed that the technology worked fine. All the individual systems and components performed as designed. They flagged up potential errors before they happened. So someone obviously screwed up. That would have been the traditional verdict. However, the hospital allowed Wachter to interview everyone involved in each of the steps. He observed how the systems were used in real conditions, not in a demonstration or test environment. Over five articles he told a compelling story that will force any fair reader to admit “yes, I’d have probably made the same error in those circumstances”.

Happily the patient survived the overdose. The hospital staff involved were not disciplined and were allowed to return to work. The hospital had to think long and hard about how it would try to prevent such mistakes recurring. The uncomfortable truth they had to confront was that there were no simple answers. Blaming human error was a cop out. Adding more alerts would compound the problems staff were already facing; one of the causes of the mistake was the volume of alerts swamping staff making it hard, or impossible, to sift out the vital warnings from the important and the merely useful.

One of the hard lessons was that focussing on making individual components more reliable had harmed the overall system. The story is an important illustration of the maxim in the safety critical community that trying to make systems safer can make them less safe.

Some system changes were required and made, but the hospital realised that the deeper problem was organisational and cultural. They made the brave decision to allow Wachter to publicise his investigation and his series of articles is well worth reading.

The response of the safety critical community to such problems and the necessary trade offs that a practical response requires, is intriguing with important lessons for software testers. I shall turn to this in my next post, “part 6 – Safety II, a new way of looking at safety”.

The dragons of the unknown; part 4 – a brief history of accident models

Introduction

This is the fourth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

This post looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

Why do accidents happen?

Taybridge_from_law_02SEP05I want to take you back to the part of the world I come from, the east of Scotland. The Tay Bridge is 3.5 km long, the longest railway bridge in the United Kingdom. It’s the second railway bridge over the Tay. The first was opened in 1878 and came down in a storm in 1879, taking a train with it and killing everyone on board.

The stumps of the old bridge were left in place because of concern that removing them would disturb the riverbed. I always felt they were there as a lesson for later generations. Children in that part of Scotland can’t miss these stumps. I remember being bemused when I learned about the disaster. “Mummy, Daddy, what are those things beside the bridge? What…? Why…? How…?” So bridges could fall down. Adults could screw up. Things could go badly wrong. There might not be a happy ending. It was an important lesson in how the world worked for a small child.

Accident investigations are difficult and complex even for something like a bridge, which might appear, at first sight, to be straightforward in concept and function. The various factors that featured in the inquiry report for the Tay Bridge disaster included the bridge design, manufacture of components, construction, maintenance, previous damage, wind speed, train speed and the state of the riverbed.

These factors obviously had an impact on each other. That’s confusing enough, and it’s far worse for complex socio-technical systems. You could argue that a bridge is either safe or unsafe, usable or dangerous. It’s one or the other. There might be argument about where you would draw the line, and about the context in which the bridge might be safe, but most people would be comfortable with the idea of a line. Safety experts call that idea of a line separating the unbroken from the broken as the bimodal principle (not to be confused with Gartner’s bimodal IT management); a system is either working or it is broken.Tay Bridge 2.0 'pass'

Thinking in bimodal terms becomes pointless when you are working with systems that run in a constantly flawed state, one of partial failure, when no-one knows how these systems work or even exactly how they should work. This is all increasingly recognised. But when things go wrong and we try to make sense of them there is a huge temptation to arrange our thinking in a way with which we are comfortable, we fall back on mental models that seem to make sense of complexity, however far removed they are from reality. These are the envisioned worlds I mentioned in part 1.

We home in on factors that are politically convenient, the ones that will be most acceptable to the organisation’s culture. We can see this, and also how thinking has developed, by looking at the history of the conceptual models that have been used for accident investigations.

Heinrich’s Domino Model (1931)

Domino Model (fig 3)The Domino Model was a traditional and very influential way to help safety experts make sense of accidents. Accidents happened because one factor kicked into another, and so on down the line of dominos, as illustrated by Heinrich’s figure 3. Problems with the organisation or environment would lead to people making mistakes and doing something dangerous, which would lead to accidents and injury. It assumed neat linearity & causation. Its attraction was that it appealed to management. Take a look at the next two diagrams in the sequence, figures 4 and 5.

Domino Model (figs 4-5)The model explicitly states that taking out the unsafe act will stop an accident. It encouraged investigators to look for a single cause. That was immensely attractive because it kept attention away from any mistakes the management might have made in screwing up the organisation. The chain of collapsing dominos is broken by removing the unsafe act and you don’t get an accident.

The model is consistent with beating up the workers who do the real work. But blaming the workers was only part of the problem with the Domino Model. It was nonsense to think that you could stop accidents by removing unsafe acts, variations from process, or mistakes from the chain. It didn’t have any empirical, theoretical or scientific basis. It was completely inappropriate for complex systems. Thinking in terms of a chain of events was quite simply wrong when analysing these systems. Linearity, neat causation and decomposing problems into constituent parts for separate analysis don’t work.

Despite these fundamental flaws the Domino Model was popular. Or rather, it was popular because of its flaws. It told managers what they wanted to hear. It helped organisations make sense of something they would otherwise have been unable to understand. Accepting that they were dealing with incomprehensible systems was too much to ask.

Swiss Cheese Model (barriers)

Swiss Cheese ModelJames Reason’s Swiss Cheese Model was the next step and it was an advance, but limited. The model did recognise that problems or mistakes wouldn’t necessarily lead to an accident. That would happen only if a series of them lined up. You can therefore stop the accident recurring by inserting a new barrier. However, the model is still based on the idea of linearity and of a sequence of cause and effect, and also the idea that you can and should decompose problems for analysis. This is a dangerously limited way of looking at what goes wrong in complex socio-technical systems, and the danger is very real with safety critical systems.

Of course, there is nothing inherently wrong with analysing systems and problems by decomposing them or relying on an assumption of cause and effect. These both have impeccably respectable histories in science and philosophy. Reducing problems to their component parts has its intellectual roots in the work of Rene Descartes. This approach implies that you can understand the behaviour of a system by looking at the behaviour of its components. Descartes’ approach (the Cartesian) fits neatly with a Newtonian scientific worldview, which holds that it is possible to identify definite causes and effects for everything that happens.

If you want to understand how a machine works and why it is broken, then these approaches are obviously valid. They don’t work when you are dealing with a complex socio-technical system. The whole is quite different from the sum of its parts. Thinking of linear flows is misleading when the different elements of a system are constantly influencing each other and adapting to feedback. Complex systems have unpredictable, emergent properties, and safety is an emergent outcome of complex socio-technical systems.

All designs are a compromise

Something that safety experts are keenly aware of is that all designs are compromises. Adding a new barrier, as envisaged by the Swiss Cheese Model, to try and close off the possibility of an accident can be counter-productive. Introducing a change, even if it’s a fix, to one part of a complex system affects the whole system in unpredictable and possibly harmful ways. The change creates new possibilities for failure, that are unknown, even unknowable.

It’s not a question of regression testing. It’s bigger and deeper than that. The danger is that we create new pathways to failure. The changes might initially seem to work, to be safe, but they can have damaging results further down the line as the system adapts and evolves, as people push the system to the edges.

There’s a second danger. New alerts or controls increase the complexity with which the user has to cope. That was a problem I now recognise with our approach as auditors. We didn’t think through the implications carefully enough. If you keep throwing in fixes, controls and alerts then the user will miss the ones they really need to act on. That reduces the effectiveness, the quality and ultimately the safety of the system. I’ll come back to that later. This is an important paradox. Trying to make a system more reliable and safer can make it more dangerous and less reliable.MiG-29

The designers of the Soviet Union’s MiG-29 jet fighter observed, “the safest part is the one we could leave off”, (according to Sidney Dekker in his book “The field guide to understanding human error”).

Drift into failure

A friend once commented that she could always spot one of my designs. They reflected a very pessimistic view of the world. I couldn’t know how things would go wrong, I just knew they would and my experience had taught me where to be wary. Working in IT audit made me very cautious. Not only had I completely bought into Murphy’s Law, “anything that can go wrong will go wrong” I had my own variant; “and it will go wrong in ways I couldn’t have imagined”.

William Langewiesche is a writer and former commercial pilot who has written extensively on aviation. He provided an interesting and insightful correction to Murphy, and also to me (from his book “Inside the Sky”).

“What can go wrong usually goes right”.

There are two aspects to this. Firstly, as I have already discussed, complex socio-technical systems are always flawed. They run with problems, bugs, variations from official process, and in the way people behave. Despite all the problems under the surface everything seems to go fine, till one day it all goes wrong.

The second important insight is that you can have an accident even if no individual part of the system has gone wrong. Components may have always worked fine, and continue to do so, but on the day of disaster they combine in unpredictable ways to produce an accident.

Accidents can happen when all the components have been working as designed, not just when they fail. That’s a difficult lesson to learn. I’d go so far as to say we (in software development and engineering and even testing) didn’t want to learn it. However, that’s the reality, however scary it is.

Sidney Dekker developed this idea in a fascinating and important book, “Drift Into Failure”. His argument is that we are developing massively complex systems that we are incapable of understanding. It is therefore misguided to think in terms of system failure arising from a mistake by an operator or the sudden failure of part of the system.

“Drifting into failure is a gradual, incremental decline into disaster driven by environmental pressure, unruly technology and social processes that normalise growing risk. No organisation is exempt from drifting into failure. The reason is that routes to failure trace through the structures, processes and tasks that are necessary to make an organization successful. Failure does not come from the occasional, abnormal dysfunction or breakdown of these structures, processes or tasks, but is an inevitable by-product of their normal functioning. The same characteristics that guarantee the fulfillment of the organization’s mandate will turn out to be responsible for undermining that mandate…

In the drift into failure, accidents can happen without anything breaking, without anybody erring, without anyone violating rules they consider relevant.”

The idea of systems drifting into failure is a practical illustration of emergence in complex systems. The overall system adapts and changes over time, behaving in ways that could not have been predicted from analysis of the components. The fact that a system has operated safely and successfully in the past does not mean it will continue to do so. Dekker says;

“Empirical success… is no proof of safety. Past success does not guarantee future safety.”

Dekker’s argument about drifting to failure should strike a chord with anyone who has worked in large organisations. Complex systems are kept running by people who have to cope with unpredictable technology, in an environment that increasingly tolerates risk so long as disaster is averted. There is constant pressure to cut costs, to do more and do it faster. Margins are gradually shaved in tiny incremental changes, each of which seems harmless and easy to justify. The prevailing culture assumes that everything is safe, until suddenly it isn’t.

Langewiesche followed up his observation with a highly significant second point;

“What can go wrong usually goes right – and people just naturally draw the wrong conclusions.”

When it all does go wrong the danger is that we look for quick and simple answers. We focus on the components that we notice are flawed, without noticing all the times everything went right even with those flaws. We don’t think about the way people have been keeping systems running despite the problems, or how the culture has been taking them closer and closer to the edge. We then draw the wrong conclusions. Complex systems and the people operating them are constantly being pushed to the limits, which is an important idea that I will return to.

It is vital that we understand this idea of drift, and how people are constantly having to work with complex systems under pressure. Once we start to accept these ideas it starts to become clear that if we want to avoid drawing the wrong conclusions we have to be sceptical about traditional approaches to accident investigation. I’m talking specifically about root cause analysis, and the notion that “human error” is a meaningful explanation for accidents and problems. I will talk about these in my next post, “part 5 – accident investigations and treating people fairly”.

The dragons of the unknown; part 3 – I don’t know what’s going on

Introduction

This is the third post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. This one follows on from part 2, which talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely.

The starting point for a system audit

When we audited a live system the specifications of the requirements and design didn’t matter – they really didn’t. This was because;

  • specs were a partial and flawed picture of what was required at the time that the system was built,
  • they were not necessarily relevant to the business risks and problems facing the company at the time of the audit,
  • the system’s compliance, or failure to comply, with the specs told us nothing useful about what the system was doing or should be doing (we genuinely didn’t care about “compliance”),
  • we never thought it was credible that the specs would have been updated to reflect subsequent changes,
  • we were interested in the real behaviour of the people using the system, not what the analysts and designers thought they would or should be doing.

audit starting pointIt was therefore a complete waste of time in a tightly time boxed audit if we waded through the specs. Context driven testers have been fascinated when I’ve explained that we started with a blank sheet of paper. The flows we were interested in were the things that mattered to the people that mattered.audit starting point (2)

We would identify a key person and ask them to talk us through the business context of the system, sketching out how it fitted into its environment. The interfaces were where we always expected things to go wrong. The scope of the audit was dictated by the sketches of the people who mattered, not the system documentation.

IDEF0 notationWe might have started with a blank sheet but we were highly rigorous. We used a structured methods modelling technique called IDEF0 to make sense of what we were learning, and to communicate that understanding back to the auditees to confirm that it made sense.

We were constantly asking, “How do you know that the system will do what we want? How will you get the outcomes you need?. What must never happen? How does the system prevent that? What must always happen? How does the system ensure that?” It’s a similar approach to the safety critical idea of always events and never events. It is particularly popular in medical safety circles.

We were dealing with financial systems. Our concern could be summarised as; how can we be confident in the processing integrity of the system? How do we know that the processing is complete, accurate, authorised and timely? It was almost a mantra; complete, accurate, authorised and timely.

These are all constrained by each other, and informed by the context, i.e. sufficiently accurate for business objectives given the need to provide the information within an acceptable time. We had to understand the current context. Context was everything.

Once we had a good idea of the processes, the outputs, the key risks and the controls that were needed, we would attack the system to see if we could force it to do what it shouldn’t, or prevent it doing what it was required to do. We would try to approach the testing with the mindset of a dishonest or irresponsible user. At that time I had never heard of exploratory testing. Training in that would have been invaluable.

We would also speak to ordinary users and watch them in action. Our interviews, observations, and our own testing told us far more about the system and how it was being used than the formal system documentation could. It also told us more than we could learn from the developers who looked after the systems. They would often be taken by surprise by what we discovered about how users were really working with their systems.

We were always asking questions to help us identify the controls that would give us the right outcomes. This is very similar to the way experts look at safety critical systems. Safety is a control problem, a question of ensuring there are mechanisms or practices in place that will help the system and its users from straying into dangerous territory. System developers cannot know how their systems will be used as part of a complex socio-technical system. They might think they do, but users will always take the system into unknown territory.

“System as imagined” versus “system as found”

The safety critical community makes an important distinction between the system as imagined and the system as found. The imagined system is neat and tidy. It is orderly, without noise, confusion and distraction. Real people are absent, or not meaningfully represented.into the unknown

A user who is working with a system for several hours a day for years on end will know all about the short cuts, hacks and vulnerabilities that are available. They make the system do things that the designers never imagined. They will understand the gaps in the system, the gaps in the designers’ understanding. The users would then have to use their own ingenuity. These user variations are usually beneficial and help the system work as the business requires. They can also be harmful and provide opportunities for fraud, a big concern in an insurance company. Large insurers receive and pay out millions of pounds a day, with nothing tangible changing hands. They have always been vulnerable to fraud, both by employees and outsiders.

how honest are the usersI investigated one careful user who stole over a million pounds, slice by slice, several thousand pounds a week, year after year, all without attracting any attention. He was exposed only by an anonymous tip off. It was always a real buzz working on those cases trying to work out exactly what the culprit had done and how they’d exploited the systems (note the plural – the best opportunities usually exploited loopholes between interfacing systems).

What shocked me about that particular case was that the fraudster hadn’t grabbed the money and run. He had settled in for a long term career looting systems we had thought were essentially sound and free of significant bugs. He was confident that he would never be caught. After piecing together the evidence I knew that he was right. There was nothing in the systems to stop him or to reveal what he had done, unless we happened to investigate him in detail.

Without the anonymous tip from someone he had double crossed he would certainly have got away with it. That forced me to realise that I had very little idea what was happening out in the wild, in the system as found.

The system as found is messy. People are distracted and working under pressure. What matters is the problems and the environment the people are dealing with, and the way they have to respond and adapt to make the system work in the mess.

There are three things you really shouldn’t say to IT auditors. In ascending facepalm order.three things you don't say to IT auditors

“But we thought audit would expect …”.

“But the requirements didn’t specify…”.

“But users should never do that”.

The last was the one that really riled me. Developers never know what users will do. They think they do, but they can’t know with any certainty. Developers don’t have the right mindset to think about what real users will do. Our (very unscientific and unevidenced) rule of thumb was as follows. 10% of people will never steal, regardless of the temptation. 10% will always try to steal, so systems must produce and retain the evidence to ensure they will be caught. The other 80% will be honest so long as we don’t put temptation in their way, so we have to explore the application to find the points of weakness that will tempt users.

Aside from their naivety, in auditors’ eyes, regarding fraud and deliberate abuse of the system, developers, and even business analysts, don’t understand the everyday pressures users will be under when they are working with complex socio-technical systems. Nobody knows how these systems really work. It’s nothing to be ashamed of. It’s the reality and we have to be honest about that.

One of the reasons I was attracted to working in audit and testing, and lost my enthusiasm for working in information security, was that these roles required me to think about what was really going on. How is this bafflingly complex organisation working? We can’t know for sure. It’s not a predictable, deterministic machine. All we can say confidently is that certain factors are more likely to produce good outcomes and others are more likely to give us bad outcomes.

If anyone does claim they do fully understand a complex socio-technical system then one of the following applies.

  • They’re bullshitting, which is all too common, and are happy to appear more confident than they have any right to be. Sadly it’s a good look in many organisations.
  • They’re deluded and genuinely have no idea of the true complexity.
  • They understand only part of the system – probably one of the less complex parts, and they’re ignoring the rest. In fairness, they might have made a conscious decision to focus only on the part that they can understand. However, other people might not appreciate the significance of that qualification, and no-one might spot that the self-professed expert has defined the problem in a way that is understandable but not realistic.
  • They did have a good understanding of the system once upon a time, when it was simpler, before it evolved into a complex beast.

It is widely believed that mapmakers in the Middle Ages would fill in the empty spaces with dragons. It’s not true. It’s just a myth, but it is a nice idea. It is a neat analogy because the unknown is scary and potentially dangerous. That’s been picked up by people working with safety critical systems, specifically the resilience engineering community. They use phrases like “jousting with dragons” and “facing the dragons at the borderlands”.here be dragons

Safety critical experts use this metaphor of dangerous dragons for reasons I have been outlining in this series. Safety critical systems are complex socio-technical systems. Nobody can specify how these systems will behave, what people will have to do to keep them running, running safely. The users will inevitably take these systems into unknown, and therefore dangerous, territory. That has huge implications for safety critical systems. I want to look at how the safety community has responded to the problem of trying to understand why systems can have bad outcomes when they can’t even know how systems are supposed to behave. I will pick that up in later posts in this series.

In the next post I will talk about the mental models we use to try and understand failures and accidents, “part 4 – a brief history of accident models”.

The dragons of the unknown; part 2 – crucial features of complex systems

Introduction

This is the second post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. This post is about the nature of complex systems and argues that complex, modern IT systems are essentially complex adaptive systems. The post discusses some features that have significant implications for testing. We have been slow to recognise the implications of these features.

Complex systems are probabilistic (stochastic) not deterministic

A deterministic system will always produce the same output, starting from a given initial state and receiving the same input. Probabilistic, or stochastic, systems are inherently unpredictable and therefore non-deterministic. Stochastic is defined by the Oxford English Dictionary as “having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.”

Traditionally, non-determinism meant a system was badly designed, inherently buggy, and untestable. Testers needed deterministic systems to do their job. It was therefore the job of designers to produce systems that were deterministic, and testers would demonstrate whether or not the systems met that benchmark. Any non-determinism meant a bug had to be removed.

Is that right or nonsense? Well, neither, or rather it depends on the context you choose. It depends what you choose to look at. You can restrict yourself to a context where determinism holds true, or you can expand your horizons. The traditional approach to determinism is correct, but only within carefully defined limits.

You can argue, quite correctly, that a computer program cannot have the properties of a true complex system. A program does what it’s coded to do: outputs can always be predicted from the inputs, provided you’re clever enough and you have enough time. For a single, simple program that is certainly true. A fearsomely complicated program might not be meaningfully deterministic, but we can respond constructively to that with careful design, and sensitivity to the needs of testing and maintenance. However, if we draw the context wider than individual programs the weaker becomes our confidence that we can know what should happen.

Once you’re looking at complex socio-technical systems, i.e. systems where people interact with complex technology, then any reasonable confidence that we can predict outcomes accurately has evaporated. These are the reasons.

Even if the system is theoretically still deterministic we don’t have brains the size of a planet, so for practical purposes the system becomes non-deterministic.

The safety critical systems community likes to talk about tractable and intractable systems. They know that the complex socio-technical systems they work with are intractable, which means that they can’t even describe with confidence how they are supposed to work (a problem I will return to). Does that rule out the possibility of offering a meaningful opinion about whether they are working as intended?

That has huge implications for testing artificial intelligence, autonomous vehicles and other complex technologies. Of course testers will have to offer the best information they can, but they shouldn’t pretend they can say these systems are working “as intended” because the danger is that we are assuming some artificial and unrealistic definition of “as intended” that will fit the designers’ limited understanding of what the system will do. I will be returning to that. We don’t know what complex systems will do.

In a deeply complicated system things will change that we are unaware of. There will always be factors we don’t know about, or whose impact we can’t know about. Y2K changed the way I thought about systems. Experience had made us extremely humble and modest about what we knew, but there was a huge amount of stuff we didn’t even know we didn’t know. At the end of the lengthy, meticulous job of fixing and testing we thought we’d allowed for everything, in the high risk, date sensitive areas at least. We were amazed how many fresh problems we found when we got hold of a dedicated mainframe LPAR, effectively our own mainframe, and booted it up with future dates.

We discovered that there were vital elements (operating system utilities, old vendor tools etc) lurking in the underlying infrastructure that didn’t look like they could cause a problem but which interacted with application code in ways we could not have predicted when run with Y2K dates. The fixed systems had run satisfactorily with overrides to the system date in test enviroments that were built to mirror production, but they crashed when they ran on a mainframe running at future system dates. We were experts, but we hadn’t known what we didn’t know.

The behaviour of these vastly complicated systems was indistinguishable from complex, unpredictable systems. When a test passes with such a system there are strict limits to what we should say with confidence about the system.

As Michael Bolton tweeted;

Michael Bolton's tweet“A ‘passing’ test doesn’t mean ‘no problem’. It means ‘no problem *observed*. This time. With these inputs. So far. On my machine’.”

So, even if you look at the system from a narrow technical perspective, the computerised system only, the argument that a good system has to be deterministic is weak. We’ve traditionally tested systems as if they were calculators, which should always produce the same answers from the same sequence of button presses. That is a limited perspective. When you factor in humans then the ideal of determinism disintegrates.

In any case there are severe limits to what we can say about the whole system from our testing of the components. A complex system behaves differently from the aggregation of its components. It is more than the sum. That brings us to an important feature of complex systems. They are emergent. I’ll discuss this in the next section.

My point here is that the system that matters is the wider system. In the case of safety critical systems, the whole, wider system decides whether people live or die.

Instead of thinking of systems as being deterministic, we have to accept that complex socio-technical systems are stochastic. Any conclusions we reach should reflect probability rather than certainty. We cannot know what will happen, just what is likely. We have to learn about the factors that are likely to tip the balance towards good outcomes, and those that are more likely to give us bad outcomes.

I can’t stress strongly enough that lack of determinism in socio-technical systems is not a flaw, it’s an intrinsic part of the systems. We must accept that and work with it. I must also stress that I am not dismissing the idea of determinism or of trying to learn as much as possible about the behaviour of individual programs and components. If we lose sight of what is happening within these it becomes even more confusing when we try to look at a bigger picture. Likewise, I am certainly not arguing against Test Driven Development, which is a valuable approach for coding. Cling to determinism whenever you can, but accept its limits – and abandon all hope that it will be available when you have to learn about the behaviour of complex socio-technical systems.

We have to deal with whole systems as well as components, and that brings me to the next point. It’s no good thinking about breaking the system down into its components and assuming we can learn all we need to by looking at them individually. Complex systems have emergent behaviour.

Complex systems are emergent; the whole is greater than the sum of the parts

It doesn’t make sense to talk of an H2O molecule being wet. Wetness is what you get from massive quantities of them. The behaviour or the nature of the components in isolation doesn’t tell you about the behaviour or nature of the whole. However, the whole is entirely consistent with the components. The H2O molecules are subject to the Periodic Law and that remains so regardless of whether they are combined. But once they are combined they become water, which is unquestionably wet and is governed by the laws of fluid dynamics. If you look at the behaviour of free surface water in the oceans under the influence of wind then you are dealing with a stochastic process. The development of an individual wave is unpredictable, but reasonable predictions can be made about a long series of waves.

As you draw back and look at the wider picture, rather than the low level components you see that the components are combining in ways that couldn’t possibly have been predicted simply by looking at the components and trying to extrapolate.

Starlings offer another good illustration of emergence. These birds combine in huge flocks to form murmurations, amazing, constantly evolving aerial patterns that look as if a single brain is in control. The individual birds are aware of only seven others, rather than the whole murmuration. They concentrate on those neighbours and respond to their movements. Their behaviour isn’t any different from what they can do on their own. However well you understood the individual starling and its behaviour you could not possibly predict what these birds do together.


Likewise with computer systems, even if all of the components are well understood and working as intended the behaviour of the whole is different from what you’d expect from simply looking at these components. This applies especially when humans are in the loop. Not only is the whole different from the sum of the parts, the whole system will evolve and adapt unpredictably as people find out what they have to do to make the system work, as they patch it and cover for problems and as they try to make it work better. This is more than a matter of changing code to enhance the system. It is about how people work with the system.

Safety is an emergent property of complex systems. The safety critical experts know that they cannot offer a meaningful opinion just by looking at the individual components. They have to look at how the whole system works.

In complex systems success & failure are not absolutes

Success & failure are not absolutes. A system might be flawed, even broken, but still valuable to someone. There is no right, simple answer to the question “Is it working? Are the figures correct?”

Appropriate answers might be “I don’t know. It depends. What do you mean by ‘working’? What is ‘correct’? Who is it supposed to be working for?”

The insurance finance systems I used to work on were notoriously difficult to understand and manipulate. 100% accuracy was never a serious, practicable goal. As I wrote in “Fix on failure – a failure to understand failure”;

“With complex financial applications an honest and constructive answer to the question ‘is the application correct?’ would be some variant on ‘what do you mean by correct?’, or ‘I don’t know. It depends’. It might be possible to say the application is definitely not correct if it is producing obvious garbage. But the real difficulty is distinguishing between the seriously inaccurate, but plausible, and the acceptably inaccurate that is good enough to be useful. Discussion of accuracy requires understanding of critical assumptions, acceptable margins of error, confidence levels, the nature and availability of oracles, and the business context of the application.”

I once had to lead a project to deliver a new sub-system that would be integrated into the main financial decision support system. There were two parallel projects, each tackling a different line of insurance. I would then be responsible for integrating the new sub-systems to the overall system, a big job in itself.

The other project manager wanted to do his job perfectly. I wanted to do whatever was necessary to build an acceptable system in time. I succeeded. The other guy delivered late and missed the implementation window. I had to carry on with the integration without his beautiful baby.

By the time the next window came around there were no developers available to make the changes needed to bring it all up to date. The same happened next time, and then the next time, and then… and eventually it was scrapped without ever going live.

If you compared the two sub-systems in isolation there was no question that the other man’s was far better than the one I lashed together. Mine was flawed but gave the business what they needed, when they needed it. The other was slightly more accurate but far more elegant, logical, efficient and lovingly crafted. And it was utterly useless. The whole decision support system was composed of sub-systems like mine, flawed, full of minor errors, needing constant nursing, but extremely valuable to the business. If we had chased perfection we would never have been able to deliver anything useful. Even if we had ever achieved perfection it would have been fleeting as the shifting sands of the operational systems that fed it introduced new problems.

The difficult lesson we had to learn was that flaws might have been uncomfortable but they were an inescapable feature of these systems. If they were to be available when the business needed them they had to run with all these minor flaws.

Richard Cook expanded on this point in his classic, and highly influential article from 1998 “How complex systems fail”. He put it succinctly.

“Complex systems run in degraded mode.”

Cook’s arguments ring true to those who have worked with complex systems, but it hasn’t been widely appreciated in the circles of senior management where budgets, plans and priorities are set.

Complex systems are impossible to specify precisely

SystemanticsCook’s 1998 paper is important, and I strongly recommend it, but it wasn’t quite ground breaking. John Gall wrote a slightly whimsical and comical book that elaborated on the same themes back in 1975. “Systemantics; how systems work and especially how they fail”. Despite the jokey tone he made serious arguments about the nature of complex systems and the way that organisations deal, and fail to deal, with them. Here is a selection of his observations.

“Large systems usually operate in failure mode.”

“The behaviour of complex systems… living or non-living, is unpredictable.”

“People in systems do not do what the system says they are doing.”

“Failure to function as expected is an intrinsic feature of systems.”

John Gall wrote that fascinating and hugely entertaining book more than forty years ago. He nailed it when he discussed the problems we’d face with complex socio-technical systems. How can we say the system is working properly if we neither know how it is working, or even how it is supposed to work? Or what the people are doing within the system?

The complex systems we have to deal with are usually socio-technical systems. They operate in a social setting, with humans. People make the systems work and they have to make decisions under pressure in order to keep the system running. Different people will do different things. Even the same person might act differently at different times. That makes the outcomes from such a system inherently unpredictable. How can we specify such a system? What does it even mean to talk of specifying an unpredictable system?

That’s something that the safety critical experts focus on. People die because software can trip up humans even when it is working smoothly as designed. This has received a lot of attention in medical circles. I’ll come back to that in a later post.

That is the reality of complex socio-technical systems. These systems are impossible to specify with complete accuracy or confidence, and certainly not at the start of any development. Again, this is not a bug, but an inescapable feature of complex socio-technical systems. Any failure may well be in our expectations, a flaw in our assumptions and knowledge, and not necessarily the system. If we are to work responsibly with complex IT systems we have to recognise that they are complex adaptive systems; the whole is different from the sum of the parts, they have emergent behaviour, and their behaviour is not predictable,

This reflected my experience with the insurance finance systems, especially for Y2K, and it was also something I had to think seriously about when I was an IT auditor. I will turn to that in my next post, “part 3 – I don’t know what’s going on”.