March 7, 2019 by James Christie

The dragons of the unknown; part 9 – learning to live with the unknowable

Introduction

This is the ninth and final post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “facing the dragons part 1 – corporate bureaucracies”. Part 2 was about the nature of complex systems. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “I don’t know what’s going on”.

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “accident investigations and treating people fairly”, looked at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

Part six “Safety II, a new way of looking at safety” looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

The seventh post “Resilience requires people” is about the importance of system resilience and the vital role that people play in keeping systems going.

The eighth post “How we look at complex systems” is about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

This final post will try to draw all these strands together and present some thoughts about the future of testing as we are increasingly confronted with complex systems that are beyond our ability to comprehend.

Computing will become more complex

Even if we choose to focus on the simpler problems, rather than help users understand complexity, the reality is that computing is only going to get more complex. The problems that users of complex socio-technical systems have to grapple with will inevitably get more difficult and more intractable. The choice is whether we want to remain relevant, but uncomfortable, or go for comfortable bullshit that we feel we can handle. Remember Zadeh’s Law of Incompatibility (see part 7 – resilience requires people). “As complexity increases, precise statements lose their meaning, and meaningful statements lose precision”. Quantum computing, artificial intelligence and adaptive algorithms are just three of the areas of increasing importance whose inherent complexity will make it impossible for testers to offer opinions that are both precise and meaningful.

Quantum computing, in particular, is fascinating. By its very nature it is probabilistic, not deterministic. The idea that well designed and written programs should always produce the same output from the same data is relevant only to digital computers (and even then the maxim has to be heavily qualified in practice); it never holds true at any level for quantum computers. I wrote about this in “Quantum computing; a whole new field of bewilderment”.

The final quote from that article, “perplexity is the beginning of knowledge”, applies not only to quantum computing but also to artificial intelligence and the fiendish complexity of algorithms processing big data. One of the features of quantum computing is the way that changing a single qubit, the equivalent of digital bytes, will trigger changes in other qubits. This is entanglement, but the same word is now being used to describe the incomprehensible complexity of modern digital systems. Various writers have talked about this being the Age of Entanglement, eg Samuel Arbesman, in his book “Overcomplicated: Technology at the Limits of Comprehension)”, Emmet Connolly, in an article “Design in the Age of Entanglement” and Danny Hillis, in an article “The Enlightenment is Dead, Long Live the Entanglement”.

The purist in me disapproves of recycling a precise term from quantum science to describe loosely a phenomenon in digital computing. However, it does serve a useful role. It is a harsh and necessary reminder and warning that modern systems have developed beyond our ability to understand them. They are no more comprehensible than quantum systems, and as Richard Feynman is popularly, though possibly apocryphally, supposed to have said; “If you think you understand quantum physics, you don’t understand quantum physics.”

So the choice for testers will increasingly be to decide how we respond to Zadeh’s Law. Do we offer answers that are clear, accurate, precise and largely useless to the people who lose sleep at night worrying about risks? Or do we say “I don’t know for sure, and I can’t know, but this is what I’ve learned about the dangers lurking in the unknown, and what I’ve learned about how people will try to stay clear of these dangers, and how we can help them”?

If we go for the easy options and restrict our role to problems which allow definite answers then we will become irrelevant. We might survive as process drones, holders of a “bullshit job” that fits neatly into the corporate bureaucracy but offers little of value. That will be tempting in the short to medium term. Large organisations often value protocol and compliance more highly than they value technical expertise. That’s a tough problem to deal with. We have to confront that and communicate why that worldview isn’t just dated, it’s wrong. It’s not merely a matter of fashion.

If we’re not offering anything of real value then there are two possible dangers. We will be replaced by people prepared to do poor work cheaper; if you’re doing nothing useful then there is always someone who can undercut you. Or we will be increasingly replaced by automation because we have opted to stay rooted in the territory where machines can be more effective, or at least efficient.

If we fail to deal with complexity the danger is that mainstream testing will be restricted to “easy” jobs – the dull, boring jobs. When I moved into internal audit I learned to appreciate the significance of all the important systems being inter-related. It was where systems interfaced, and when people were involved that they got interesting. The finance systems with which I worked may have been almost entirely batch based, but they performed a valuable role for the people with whom we were constantly discussing the behaviour of these systems. Anything standalone was neither important nor particularly interesting. Anything that didn’t leave smart people scratching their heads and worrying was likely to be boring. Inter-connectedness and complexity will only increase and however difficult testing becomes it won’t be boring – so long as we are doing a useful job.

If we want to work with the important, interesting systems then we have to deal with complexity and the sort of problems the safety people have been wrestling with. There will always be a need for people to learn and inform others about complex systems. The American economist Tyler Cowen in his book “Average is Over” states the challenge clearly. We will need interpreters of complex systems.

“They will become clearing houses for and evaluators of the work of others… They will hone their skills of seeking out, absorbing and evaluating information… They will be translators of the truths coming out of our network of machines… At least for a while they will be the only people who will have a clear notion of what is going on.”

I’m not comfortable with the idea of truths coming out of machines, and we should resist the idea that we can ever be entirely clear about what is going on. But the need for experts who can interpret complex systems is clear. Society will look for them. Testers should aspire to be among those valuable specialists. The complexity of these systems will be beyond the ability of any one person to comprehend, but perhaps these interpreters, in addition to deploying their own skills, will be able to act like a conductor of an orchestra, to return to the analogy I used in part seven (Resilience requires people). Conductors are talented musicians in their own right, but they call on the expertise of different specialists, blending their contribution to produce something of value to the audience. Instead of a piece of music the interpreter tester would produce a story that sheds light on the system, guiding the people who need to know.

Testers in the future will have to be confident and assertive when they try to educate others about complexity, the inexplicable and the unknowable. Too often in corporate life a lack of certainty has been seen as a weakness. We have to stand our ground and insist on our right to be heard and taken seriously when we argue that certainty cannot be available if we want to talk about the risks and problems that matter. My training and background meant I couldn’t keep quiet when I saw problems, that were being ignored because no-one knew how to deal with them. As Better Software said about me, I’m never afraid to voice my opinion. better software says I am never afraid to voice my opinion

Never be afraid to speak out, to explain why your experience and expertise make your opinions valuable, however uncomfortable these may be for others. That’s what you’re paid for, not to provide comforting answers. The metaphor of facing the dragons of the unknown is extremely important. People will have to face these dragons. Testers have a responsibility to try and shed as much light as possible on those dragons lurking in the darkness beyond what we can see and understand easily. If we concentrate only on what we can know and say with certainty it means we walk away from offering valuable, heavily qualified advice about the risks, threats & opportunities that matter to people. Our job should entail trying to help and protect people. As Jerry Weinberg said in “Secrets of Consulting”;

“No matter what they tell you, it’s always a people problem.”

February 1, 2019 by James Christie

The dragons of the unknown; part 7 – resilience requires people

Introduction

This is the seventh post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

This post is about the importance of system resilience and the vital role that people play in keeping systems going.

Robustness versus resilience

The idea of resilience is where Safety II and Cynefin come together in a practical way for software development. Safety critical professionals have become closely involved in the field of resilience engineering. Dave Snowden, Cynefin’s creator, places great emphasis on the need for systems in complex environments to be resilient.

First, I’d like to make an important distinction, between robustness and resilience. The example Snowden uses is that a seawall is robust but a salt marsh is resilient. A seawall is a barrier to large waves and storms. It protects the land behind, but if it fails it does so catastrophically. A salt marsh protects inland areas by acting as a buffer, absorbing storm waves rather than repelling them. It might deteriorate over time but it won’t fail suddenly and disastrously.

Designing for robustness entails trying to prevent failure. Designing for resilience recognises that failure is inevitable in some form but tries to make that failure containable. Resilience means that recovery should be swift and relatively easy when it does occur, and crucially, it means that failure can be detected quickly, or even in advance so that operators have a chance to respond.

What struck me about the resilience engineering approach is that it matches the way that we managed the highly complex insurance financial applications I mentioned in “part 2 – crucial features of complex systems”. We had never heard of resilience engineering, but the site standards were of limited use. We had to feel our way, finding an appropriate response as the rapid pace of development created terrifying new complexity on top of a raft of ancient legacy applications.

The need for efficient processing of the massive batch runs had to be balanced against the need to detect the constant flow of small failures early, to stop them turning into major problems, and also against the pressing need to facilitate recovery when we inevitably hit serious failure. We also had to think about what “failure” really meant in a context where 100% (or even 98%) accuracy was an unrealistic dream that would distract us from providing flawed but valuable systems to our users within the timescales that were dictated by commercial pressures.

An increasing challenge for testers will be to look for information about how systems fail, and test for resilience rather than robustness. Liz Keogh, in this talk on “Safe-to-Fail” makes a similar point.

“Testers are really, really good at spotting failure scenarios… they are awesomely imaginative at calamity… Devs are problem solvers. They spot patterns. Testers spot holes in patterns… I have a theory that other people who are in critical positions, like compliance and governance people are also really good at this.”

Developing for resilence means that tolerance for failure becomes more important than a vain attempt to prevent failure altogether. This tolerance often requires greater redundancy. Stripping out redundancy and maximizing the efficiency of systems has a downside. Greater efficiency can make applications brittle and inflexible. When problems hit they hit hard and recovery can be difficult.

However, redundancy itself adds to the complexity of systems and can create unexpected ways for them to fail. In our massively complex insurance finance systems a constant threat was that the safeguards we introduced to make the systems resilient might result in the processing runs failing to complete in time and disrupting other essential applications.

The ETTO principle (see part 6 , “Safety II – learning from what goes right”) describes the dilemmas we were constantly having to deal with. But the problems we faced were more complex than a simple trade off, sacrificing efficiency would not necessarily lead to greater effectiveness. Poorly thought out safeguards could harm both efficiency and effectiveness.

We had to nurse those systems carefully. That is a crucial idea to understand. Complex systems require constant attention by skilled people and these people are an indispensable means of controlling the systems.

Ashby’s Law of Requisite Variety

Ashby’s Law of Requisite Variety is also known as The First Law of Cybernetics.

“The complexity of a control system must be equal to or greater than the complexity of the system it controls.”

A stable system needs as much variety in the control mechanisms as there is in the system itself. This does not mean as much variety as the external reality that the system is attempting to manage – a thermostat is just on or off, it isn’t directly controlling the temperature, just whether the heating is on or off.

The implication for complex socio-technical systems is that the controlling mechanism must include humans if it is to be stable precisely because the system includes humans. The control mechanism has to be as complex and sophisticated as the system itself. It’s one of those “laws” that looks trivially obvious when it is unpacked, but whose implications can easily be missed unless we turn our minds to the problem and its implications. We must therefore trust expertise, trust the expert operators, and learn what they have to do to keep the system running.

I like the analogy of an orchestra’s conductor. It’s a flawed analogy (all models are flawed, though some are useful). The point is that you need a flexible, experienced human to make sense of the complexity and constantly adjust the system to keep it working and useful.

Really know the users

I have learned that it is crucially important to build a deep understanding of the user representatives and the world they work in. This is often not possible, but when I have been able to do it the effort has always paid off. If you can find good contacts in the user community you can learn a huge amount from them. Respect deep expertise and try to acquire it yourself if possible.

When I moved into the world of insurance finance systems I had very bright, enthusiastic, young (but experienced) users who took the time to immerse me in their world. I was responsible for the development, not just the testing. The users wanted me to understand them, their motivation, the pressures on them, where they wanted to get to, the risks they worried about, what kept them awake at night. It wasn’t about record-keeping. It was all about understanding risks and exposures. They wanted to set prices accurately, to compete aggressively using cheap prices for good risks and high prices for the poor risks.

That much was obvious, but I hadn’t understood the deep technical problems and complexities of unpacking the risk factors and the associated profits and losses. Understanding those problems and the concerns of my users was essential to delivering something valuable. The time spent learning from them allowed me to understand not only why imperfection was acceptable and chasing perfection was dangerous, but also what sort of imperfection was acceptable.

Building good, lasting relations with my users was perhaps the best thing I ever did for my employers and it paid huge dividends over the next few years.

We shouldn’t be thinking only about the deep domain experts though. It’s also vital to look at what happens at the sharp end with operational users, perhaps lowly and stressed, carrying out the daily routine. If we don’t understand these users, the pressures and distractions they face, and how they have to respond then we don’t understand the system that matters, the wider complex, socio-technical system.

Testers should be trying to learn more from experts working on human factors and ergonomics and user experience. I’ll finish this section with just a couple of examples of the level of thought and detail that such experts put into the design of aeroplane cockpits.

Boeing is extremely concerned about the danger of overloading cockpit crew with so much information that they pay insufficient attention to the most urgent warnings. The designers therefore only use the colour red in cockpits when the pilot has to take urgent action to keep the plane safe. Red appears only for events like engine fires and worse. Less urgent alerts use other colours and are less dramatic. Pilots know that if they ever see a red light or red text then they have to act. [The original version of this was written before the Boeing 737 MAX crashes. These have raised concerns about Boeing’s practices that I will return to later.]

A second and less obvious example of the level of detailed thought that goes into flight deck designs is that analog speed dials are widely considered safer than digital displays. Pilots can glance at the dial and see that the airspeed is in the right zone given all the other factors (e.g. height, weight and distance to landing) at the same time as they are processing a blizzard of other information.

A digital display isn’t as valuable. (See Edwin Hutchins’ “How a cockpit remembers its speeds“, Cognitive Science 19, 1995.) It might offer more precise information, but it is less useful to pilots when they really need to know about the aircraft’s speed during a landing, a time when they have to deal with many other demands. In a highly complex environment it is more important to be useful than accurate. Safe is more important than precise.

The speed dial that I have used as an illustration is also a good example both of beneficial user variations and of the perils of piling in extra features. The tabs surrounding the dial are known as speed bugs. Originally pilots improvised with grease pencils or tape to mark the higher and lower limits of the airspeed that would be safe for landing that flight. Designers picked up on that and added movable plastic tabs. Unfortunately, they went too far and added tabs for every eventuality, thus bringing visual clutter into what had been a simple solution. (See Donald Norman’s “Turn signals are the facial expressions of automobiles“, chapter 16, “Coffee cups in the cockpit”, Basic Books, 1993.)

We need a better understanding of what will help people make the system work, and what is likely to trip them up. That entails respect for the users and their expertise. We must not only trust them we must never lose our own humility about what we can realistically know.

As Jens Rasmussen put it (in a much quoted talk at the IEEE Standards Workshop on Human Factors and Nuclear Safety in 1981 – I have not been able to track this down).

“The operator’s role is to make up for holes in designers’ work.”

Testers should be ready to explore and try to explain these holes, the gap between the designers’ limited knowledge and the reality that the users will deal with. We have to try to think about what the system as found will be like. We must not restrict ourselves to the system as imagined.

Lessons from resilience engineering

There is a huge amount to learn from resilience engineering. This community has a significant overlap with the safety critical community. The resilience engineering literature is vast and growing. However, for a quick flavour of what might be useful for testers it’s worth looking at the four principles of Erik Hollnagel’s Functional Resonance Analysis Method (FRAM). FRAM tries to provide a way to model complex socio-technical systems so that we can gain a better understanding of likely outcomes.

- Success and failure are equivalent. They can happen in similar ways.
  It is dangerously misleading to assume that the system is bimodal, that it is either working or broken. Any factor that is present in a failure can equally be present in success.

- Success, failure and normal outcomes are all emergent qualities of the whole system.
  We cannot learn about what will happen in a complex system by observing only the individual components.

- People must constantly make small adjustments to keep the system running.
  These changes are both essential for the successful operation of the system, but also a contributory cause of failure. Changes are usually approximate adjustments, based on experience, rather than precise, calculated changes. An intrinsic feature of complex systems is that small changes can have a dramatic effect on the overall system. A change to one variable or function will always affect others.

- “Functional resonance” is the detectable result of unexpected interaction of normal variations.

Functional resonance is a particularly interesting concept. Resonance is the engineering term for the effect we get when different things vibrate with the same frequency. If an object is struck or moved suddenly it will vibrate at its natural frequency. If the object producing the force is also vibrating at the same frequency the result is resonance, and the effect of the impact can be amplified dramatically.

Resonance is the effect you see if you push a child on a swing. If your pushes match the motion of the swing you quickly amplify the motion. If your timing is wrong you dampen the swing’s motion. Resonance can produce unpredictable results. A famous example is the danger that marching troops can bring a bridge down if the rhythm of their marching coincides with the natural frequency at which the bridge vibrates.

Learning about functional resonance means learning about the way that different variables combine to amplify or dampen the effect that each has, producing outcomes that would have been entirely unpredictable from looking at their behaviour individually.

Small changes can lead to drastically different outcomes at different times depending on what else is happening. The different variables in the system will be coupled in potentially significant ways the designers did not understand. These variables can reinforce, or play off each other, unpredictably.

Safety is a control problem – a matter of controlling these interactions, which means we have to understand them first. But, as we have seen, the answer can’t be to keep adding controls to try and achieve greater safety. Safety is not only a control problem, it is also an emergent and therefore unpredictable property (see appendix). That’s not a comfortable combination for the safety critical community.

Although it is impossible to predict emergent behaviour in a complex system it is possible to learn about the sort of impact that changes and user actions might have. FRAM is not a model for testers. However, it does provide a useful illustration of the approach being taken by safety experts who are desperate to learn and gain a better understanding of how systems might work.

Good testers are surely well placed to reach out and offer their skills and experience. It is, after all, the job of testers to learn about systems and tell a “compelling story” (as Messrs Bach & Bolton put it) to the people who need to know. They need the feedback that we can provide, but if it is to be useful we all have to accept that it cannot be exact.

Lotfi Zadeh, a US mathematician, computer scientist and engineer introduced the idea of fuzzy logic. He made this deeply insightful observation, quoted in Daniel McNeill and Paul Freiberger’s book “Fuzzy Logic”.

“As complexity rises, precise statements lose meaning, and meaningful statements lose precision.”

Zadeh’s maxim has come to be known as the Law of Incompatibility. If we are dealing with complex socio-technical systems we can be meaningful or we can be precise. We cannot be both; they are incompatible in such a context. It might be hard to admit we can say nothing with certainty, but the truth is that meaningful statements cannot be precise. If we say “yes, we know” then we are misleading the people who are looking for guidance. To pretend otherwise is bullshitting.

In the eighth post of this series, “How we look at complex systems”, I will talk about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

Appendix – is safety an emergent property?

In this series I have repeatedly referred to safety as being an emergent property of complex adaptive systems. For beginners trying to get their heads round this subject it is an important point to take on board.

However, the nature of safety is rather more nuanced. Erik Hollnagel argues that safety is a state of the whole system, rather than one of the system’s properties. Further, we consciously work towards that state of safety, trying to manipulate the system to achieve the desired state. Therefore safety is not emergent; it is a resultant state, a deliberate result. On the other hand, a lack of safety is an emergent property because it arises from unpredictable and undesirable adaptions of the system and its users.

Other safety experts differ and regard safety as being emergent. For the purpose of this blog I will stick with the idea that it is emergent. However, it is worth bearing Hollnagel’s argument in mind. I am quite happy to think of safety being a state of a system because my training and experience lead me to think of states as being more transitory than properties, but I don’t feel sufficiently strongly to stop referring to safety as being an emergent property.

James Christie's Blog

thoughts of a software testing consultant from Scotland

“Law of Incompatibility”

The dragons of the unknown; part 9 – learning to live with the unknowable

Introduction

Computing will become more complex

The dragons of the unknown; part 7 – resilience requires people

Introduction

Robustness versus resilience

Ashby’s Law of Requisite Variety

Really know the users

Lessons from resilience engineering

Appendix – is safety an emergent property?