This is the seventh post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).
The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “facing the dragons part 1 – corporate bureaucracies”. Part 2 was about the nature of complex systems. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “I don’t know what’s going on”.
Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.
The fifth post, “accident investigations and treating people fairly”, looked at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.
Part six “Safety II, a new way of looking at safety” looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.
This post is about the importance of system resilience and the vital role that people play in keeping systems going.
Robustness versus resilience
The idea of resilience is where Safety II and Cynefin come together in a practical way for software development. Safety critical professionals have become closely involved in the field of resilience engineering. Dave Snowden, Cynefin’s creator, places great emphasis on the need for systems in complex environments to be resilient.
First, I’d like to make an important distinction, between robustness and resilience. The example Snowden uses is that a seawall is robust but a salt marsh is resilient. A seawall is a barrier to large waves and storms. It protects the land behind, but if it fails it does so catastrophically. A salt marsh protects inland areas by acting as a buffer, absorbing storm waves rather than repelling them. It might deteriorate over time but it won’t fail suddenly and disastrously.
Designing for robustness entails trying to prevent failure. Designing for resilience recognises that failure is inevitable in some form but tries to make that failure containable. Resilience means that recovery should be swift and relatively easy when it does occur, and crucially, it means that failure can be detected quickly, or even in advance so that operators have a chance to respond.
What struck me about the resilience engineering approach is that it matches the way that we managed the highly complex insurance financial applications I mentioned in “part 2 – crucial features of complex systems”. We had never heard of resilience engineering, but the site standards were of limited use. We had to feel our way, finding an appropriate response as the rapid pace of development created terrifying new complexity on top of a raft of ancient legacy applications.
The need for efficient processing of the massive batch runs had to be balanced against the need to detect the constant flow of small failures early, to stop them turning into major problems, and also against the pressing need to facilitate recovery when we inevitably hit serious failure. We also had to think about what “failure” really meant in a context where 100% (or even 98%) accuracy was an unrealistic dream that would distract us from providing flawed but valuable systems to our users within the timescales that were dictated by commercial pressures.
An increasing challenge for testers will be to look for information about how systems fail, and test for resilience rather than robustness. Liz Keogh, in this talk on “Safe-to-Fail” makes a similar point.
“Testers are really, really good at spotting failure scenarios… they are awesomely imaginative at calamity… Devs are problem solvers. They spot patterns. Testers spot holes in patterns… I have a theory that other people who are in critical positions, like compliance and governance people are also really good at this.”
Developing for resilence means that tolerance for failure becomes more important than a vain attempt to prevent failure altogether. This tolerance often requires greater redundancy. Stripping out redundancy and maximizing the efficiency of systems has a downside. Greater efficiency can make applications brittle and inflexible. When problems hit they hit hard and recovery can be difficult.
However, redundancy itself adds to the complexity of systems and can create unexpected ways for them to fail. In our massively complex insurance finance systems a constant threat was that the safeguards we introduced to make the systems resilient might result in the processing runs failing to complete in time and disrupting other essential applications.
The ETTO principle (see part 6 , “Safety II – learning from what goes right”) describes the dilemmas we were constantly having to deal with. But the problems we faced were more complex than a simple trade off, sacrificing efficiency would not necessarily lead to greater effectiveness. Poorly thought out safeguards could harm both efficiency and effectiveness.
We had to nurse those systems carefully. That is a crucial idea to understand. Complex systems require constant attention by skilled people and these people are an indispensable means of controlling the systems.
Ashby’s Law of Requisite Variety
Ashby’s Law of Requisite Variety is also known as The First Law of Cybernetics.
“The complexity of a control system must be equal to or greater than the complexity of the system it controls.”
A stable system needs as much variety in the control mechanisms as there is in the system itself. This does not mean as much variety as the external reality that the system is attempting to manage – a thermostat is just on or off, it isn’t directly controlling the temperature, just whether the heating is on or off.
The implication for complex socio-technical systems is that the controlling mechanism must include humans if it is to be stable precisely because the system includes humans. The control mechanism has to be as complex and sophisticated as the system itself. It’s one of those “laws” that looks trivially obvious when it is unpacked, but whose implications can easily be missed unless we turn our minds to the problem and its implications. We must therefore trust expertise, trust the expert operators, and learn what they have to do to keep the system running.
I like the analogy of an orchestra’s conductor. It’s a flawed analogy (all models are flawed, though some are useful). The point is that you need a flexible, experienced human to make sense of the complexity and constantly adjust the system to keep it working and useful.
Really know the users
I have learned that it is crucially important to build a deep understanding of the user representatives and the world they work in. This is often not possible, but when I have been able to do it the effort has always paid off. If you can find good contacts in the user community you can learn a huge amount from them. Respect deep expertise and try to acquire it yourself if possible.
When I moved into the world of insurance finance systems I had very bright, enthusiastic, young (but experienced) users who took the time to immerse me in their world. I was responsible for the development, not just the testing. The users wanted me to understand them, their motivation, the pressures on them, where they wanted to get to, the risks they worried about, what kept them awake at night. It wasn’t about record-keeping. It was all about understanding risks and exposures. They wanted to set prices accurately, to compete aggressively using cheap prices for good risks and high prices for the poor risks.
That much was obvious, but I hadn’t understood the deep technical problems and complexities of unpacking the risk factors and the associated profits and losses. Understanding those problems and the concerns of my users was essential to delivering something valuable. The time spent learning from them allowed me to understand not only why imperfection was acceptable and chasing perfection was dangerous, but also what sort of imperfection was acceptable.
Building good, lasting relations with my users was perhaps the best thing I ever did for my employers and it paid huge dividends over the next few years.
We shouldn’t be thinking only about the deep domain experts though. It’s also vital to look at what happens at the sharp end with operational users, perhaps lowly and stressed, carrying out the daily routine. If we don’t understand these users, the pressures and distractions they face, and how they have to respond then we don’t understand the system that matters, the wider complex, socio-technical system.
Testers should be trying to learn more from experts working on human factors and ergonomics and user experience. I’ll finish this section with just a couple of examples of the level of thought and detail that such experts put into the design of aeroplane cockpits.
Boeing is extremely concerned about the danger of overloading cockpit crew with so much information that they pay insufficient attention to the most urgent warnings. The designers therefore only use the colour red in cockpits when the pilot has to take urgent action to keep the plane safe. Red appears only for events like engine fires and worse. Less urgent alerts use other colours and are less dramatic. Pilots know that if they ever see a red light or red text then they have to act. [The original version of this was written before the Boeing 737 MAX crashes. These have raised concerns about Boeing’s practices that I will return to later.]
A second and less obvious example of the level of detailed thought that goes into flight deck designs is that analog speed dials are widely considered safer than digital displays. Pilots can glance at the dial and see that the airspeed is in the right zone given all the other factors (e.g. height, weight and distance to landing) at the same time as they are processing a blizzard of other information.
A digital display isn’t as valuable. (See Edwin Hutchins’ “How a cockpit remembers its speeds“, Cognitive Science 19, 1995.) It might offer more precise information, but it is less useful to pilots when they really need to know about the aircraft’s speed during a landing, a time when they have to deal with many other demands. In a highly complex environment it is more important to be useful than accurate. Safe is more important than precise.
The speed dial that I have used as an illustration is also a good example both of beneficial user variations and of the perils of piling in extra features. The tabs surrounding the dial are known as speed bugs. Originally pilots improvised with grease pencils or tape to mark the higher and lower limits of the airspeed that would be safe for landing that flight. Designers picked up on that and added movable plastic tabs. Unfortunately, they went too far and added tabs for every eventuality, thus bringing visual clutter into what had been a simple solution. (See Donald Norman’s “Turn signals are the facial expressions of automobiles“, chapter 16, “Coffee cups in the cockpit”, Basic Books, 1993.)
We need a better understanding of what will help people make the system work, and what is likely to trip them up. That entails respect for the users and their expertise. We must not only trust them we must never lose our own humility about what we can realistically know.
As Jens Rasmussen put it (in a much quoted talk at the IEEE Standards Workshop on Human Factors and Nuclear Safety in 1981 – I have not been able to track this down).
“The operator’s role is to make up for holes in designers’ work.”
Testers should be ready to explore and try to explain these holes, the gap between the designers’ limited knowledge and the reality that the users will deal with. We have to try to think about what the system as found will be like. We must not restrict ourselves to the system as imagined.
Lessons from resilience engineering
There is a huge amount to learn from resilience engineering. This community has a significant overlap with the safety critical community. The resilience engineering literature is vast and growing. However, for a quick flavour of what might be useful for testers it’s worth looking at the four principles of Erik Hollnagel’s Functional Resonance Analysis Method (FRAM). FRAM tries to provide a way to model complex socio-technical systems so that we can gain a better understanding of likely outcomes.
- Success and failure are equivalent. They can happen in similar ways.
It is dangerously misleading to assume that the system is bimodal, that it is either working or broken. Any factor that is present in a failure can equally be present in success.
- Success, failure and normal outcomes are all emergent qualities of the whole system.
We cannot learn about what will happen in a complex system by observing only the individual components.
- People must constantly make small adjustments to keep the system running.
These changes are both essential for the successful operation of the system, but also a contributory cause of failure. Changes are usually approximate adjustments, based on experience, rather than precise, calculated changes. An intrinsic feature of complex systems is that small changes can have a dramatic effect on the overall system. A change to one variable or function will always affect others.
- “Functional resonance” is the detectable result of unexpected interaction of normal variations.
Functional resonance is a particularly interesting concept. Resonance is the engineering term for the effect we get when different things vibrate with the same frequency. If an object is struck or moved suddenly it will vibrate at its natural frequency. If the object producing the force is also vibrating at the same frequency the result is resonance, and the effect of the impact can be amplified dramatically.
Resonance is the effect you see if you push a child on a swing. If your pushes match the motion of the swing you quickly amplify the motion. If your timing is wrong you dampen the swing’s motion. Resonance can produce unpredictable results. A famous example is the danger that marching troops can bring a bridge down if the rhythm of their marching coincides with the natural frequency at which the bridge vibrates.
Learning about functional resonance means learning about the way that different variables combine to amplify or dampen the effect that each has, producing outcomes that would have been entirely unpredictable from looking at their behaviour individually.
Small changes can lead to drastically different outcomes at different times depending on what else is happening. The different variables in the system will be coupled in potentially significant ways the designers did not understand. These variables can reinforce, or play off each other, unpredictably.
Safety is a control problem – a matter of controlling these interactions, which means we have to understand them first. But, as we have seen, the answer can’t be to keep adding controls to try and achieve greater safety. Safety is not only a control problem, it is also an emergent and therefore unpredictable property (see appendix). That’s not a comfortable combination for the safety critical community.
Although it is impossible to predict emergent behaviour in a complex system it is possible to learn about the sort of impact that changes and user actions might have. FRAM is not a model for testers. However, it does provide a useful illustration of the approach being taken by safety experts who are desperate to learn and gain a better understanding of how systems might work.
Good testers are surely well placed to reach out and offer their skills and experience. It is, after all, the job of testers to learn about systems and tell a “compelling story” (as Messrs Bach & Bolton put it) to the people who need to know. They need the feedback that we can provide, but if it is to be useful we all have to accept that it cannot be exact.
Lotfi Zadeh, a US mathematician, computer scientist and engineer introduced the idea of fuzzy logic. He made this deeply insightful observation, quoted in Daniel McNeill and Paul Freiberger’s book “Fuzzy Logic”.
“As complexity rises, precise statements lose meaning, and meaningful statements lose precision.”
Zadeh’s maxim has come to be known as the Law of Incompatibility. If we are dealing with complex socio-technical systems we can be meaningful or we can be precise. We cannot be both; they are incompatible in such a context. It might be hard to admit we can say nothing with certainty, but the truth is that meaningful statements cannot be precise. If we say “yes, we know” then we are misleading the people who are looking for guidance. To pretend otherwise is bullshitting.
In the eighth post of this series, “How we look at complex systems”, I will talk about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.
Appendix – is safety an emergent property?
In this series I have repeatedly referred to safety as being an emergent property of complex adaptive systems. For beginners trying to get their heads round this subject it is an important point to take on board.
However, the nature of safety is rather more nuanced. Erik Hollnagel argues that safety is a state of the whole system, rather than one of the system’s properties. Further, we consciously work towards that state of safety, trying to manipulate the system to achieve the desired state. Therefore safety is not emergent; it is a resultant state, a deliberate result. On the other hand, a lack of safety is an emergent property because it arises from unpredictable and undesirable adaptions of the system and its users.
Other safety experts differ and regard safety as being emergent.For the purpose of this blog I will stick with the idea that it is emergent. However, it is worth bearing Hollnagel’s argument in mind. I am quite happy to think of safety being a state of a system because my training and experience lead me to think of states as being more transitory than properties, but I don’t feel sufficiently strongly to stop referring to safety as being an emergent property.