January 12, 2021 by James Christie

Bugs are about more than code

Introduction

Recently I have had to think carefully about the nature of software systems, especially complex ones, and the bugs they contain. In doing so my thinking has been guided by certain beliefs I hold about complex software systems. These beliefs, or principles, are based on my practical experience but also on my studies, which, as well as teaching me much that I didn’t know, have helped me to make sense of what I have done and seen at work. Here are three vital principles I hold to be true.

Principle 1

Complex systems are not like calculators, which are predictable, deterministic instruments, i.e. they will always give the same answer from the same inputs. Complex systems are not predictable. We can only predict what they will probably do, but we cannot be certain. It is particularly important to remember this when working with complex socio-technical systems, i.e. complex systems, in the wider sense, that include humans, which are operated by people or require people to make them work. That covers most, or all, complex software systems.

Principle 2

Complex systems are more than the sum of their parts, or at least they are different. A system can be faulty even if all the individual programs,or components, are working correctly. The individual elements can combine with each other, and with the users, in unexpected and possibly damaging ways that could not have been predicted from inspecting the components separately.

Conversely, a system can be working satisfactorily even if some of the components are flawed. This inevitably means that the software code itself, however important it is, cannot be the only factor that determines the quality of the system.

Principle 3

Individual programs in a system can produce harmful outcomes even if their code was written perfectly. The outcome depends on how the different components, factors and people work together over the life of the system. Perfectly written code can cause a failure long after it has been released when there are changes to the technical, legal, or commercial environment in which the system runs.

The consequences

Bugs in complex systems are therefore inevitable. The absence of bugs in the past does not mean they are absent now, and certainly not that the system will be bug free in the future. The challenge is partly to find bugs, learn from them, and help users to learn how they can use the system safely. But testers should also try to anticipate future bugs, how they might arise, where the system is vulnerable, and learn how and whether users and operators will be able to detect problems and respond. They must then have the communication skills to explain what they have found to the people who need to know.

What we must not do is start off from the assumption that particular elements of the system are reliable and that any problems must have their roots elsewhere in the system. That mindset tends to push blame towards the unfortunate people who operate a flawed system.

Bugs and the Post Office Horizon scandal

Over the last few months I have spent a lot of time on issues raised by the Post Office Horizon scandal. For a detailed account of the scandal I strongly recommend the supplement that Private Eye has produced, written by Richard Brooks and Nick Wallis, “Justice lost in the post”.

When I have been researching this affair I have realised, time and again, how the Post Office and Fujitsu, the outsourced IT services supplier, ignored the three principles I outlined. While trawling through the judgment of Mr Justice Fraser in Bates v Post Office Ltd (No 6: Horizon Issues, i.e. the second of the two court cases brought by the Justice For Subpostmasters Alliance), which should be compulsory reading for Computer Science students, I was struck by the judge’s discussion of the nature of bugs in computer systems. You can find the full 313 page judgment here [PDF, opens in new tab].

The definition of a bug was at the heart of the second court case. The Post Office, and Fujitsu (the outsourced IT services supplier) argued that a bug is a coding error, and the word should not apply to other problems. The counsel for the claimants, i.e. the subpostmasters and subpostmistresses who had been victims of the flawed system, took a broader view; a bug is anything that means the software does not operate as users, or the corporation, expect.

After listening to both sides Fraser came down emphatically on the claimants’ side.

“26 The phrase ‘bugs, errors or defects’ is sufficiently wide to capture the many different faults or characteristics by which a computer system might not work correctly… Computer professionals will often refer simply to ‘code’, and a software bug can refer to errors within a system’s source code, but ‘software bugs’ has become more of a general term and is not restricted, in my judgment, to meaning an error or defect specifically within source code, or even code in an operating system.

Source code is not the only type of software used in a system, particularly in a complex system such as Horizon which uses numerous applications or programmes too. Reference data is part of the software of most modern systems, and this can be changed without the underlying code necessarily being changed. Indeed, that is one of the attractions of reference data. Software bug means something within a system that causes it to cause an incorrect or unexpected result. During Mr de Garr Robinson’s cross-examination of Mr Roll, he concentrated on ‘code’ very specifically and carefully [de Garr Robinson was the lawyer representing the Post Office and Roll was a witness for the claimants who gave evidence about problems with Horizon that he had seen when he worked for Fujitsu]. There is more to the criticisms levelled at Horizon by the claimants than complaints merely about bugs within the Horizon source code.

27 Bugs, errors or defects is not a phrase restricted solely to something contained in the source code, or any code. It includes, for example, data errors, data packet errors, data corruption, duplication of entries, errors in reference data and/or the operation of the system, as well as a very wide type of different problems or defects within the system. ‘Bugs, errors or defects’ is wide enough wording to include a wide range of alleged problems with the system.”

The determination of the Post Office and Fujitsu to limit the definition of bugs to source code was part of a policy of blaming users for all errors that were not obviously caused by the source code. This is clear from repeated comments from witnesses and from Mr Justice Fraser in the judgment. “User error” was the default explanation for all problems.

Phantom transactions or bugs?

This stance of blaming the users if they were confused by Horizon’s design was taken to an extreme with “phantom transactions”. These were transactions generated by the system but which were recorded as if they had been made by a user (see in particular paragraphs 209 to 214 of Fraser’s judgment).

In paragraph 212 Fraser refers to a Fujitsu problem report.

“However, the conclusion reached by Fujitsu and recorded in the PEAK was as follows:

‘Phantom transactions have not been proven in circumstances which preclude user error. In all cases where these have occurred a user error related cause can be attributed to the phenomenon.'”

This is striking. These phantom transactions had been observed by Royal Mail engineers. They were known to exist. But they were dismissed as a cause of problems unless it could be proven that user error was not responsible. If Fujitsu could imagine a scenario where user error might have been responsible for a problem they would rule out the possibility that a phantom transaction could have been the cause, even if the phantom had occurred. The PEAK (error report) would simply be closed off, whether or not the subpostmaster agreed.

This culture of blaming users rather than software was illustrated by a case of the system “working as designed” when its behaviour clearly confused and misled users. In fact the system was acting contrary to user commands. In certain circumstances if a user entered the details for a transaction, but did not commit it, the system would automatically complete the transaction with no further user intervention, which might result in a loss to the subpostmaster.

The Post Office, in a witness statement, described this as a “design quirk”. However, the Post Office’s barrister, Mr de Garr Robinson, in his cross-examination of Jason Coyne, an IT consultant hired by the subpostmasters, was able to convince Nick Wallis (one of the authors of “Justice lost in the post”) that there wasn’t a bug.

“Mr de Garr Robinson directs Mr Coyne to Angela van den Bogerd’s witness statement which notes this is a design quirk of Horizon. If a bunch of products sit in a basket for long enough on the screen Horizon will turn them into a sale automatically.

‘So this isn’t evidence of Horizon going wrong, is it?’ asks Mr de Garr Robinson. ‘It is an example of Horizon doing what it was supposed to do.’

‘It is evidence of the system doing something without the user choosing to do it.’ retorts Mr Coyne.

But that is not the point. It is not a bug in the system.”

Not a bug? I would contest that very strongly. If I were auditing a system with this “quirk” I would want to establish the reasons for the system’s behaviour. Was this feature deliberately designed into the system? Or was it an accidental by-product of the system design? Whatever the answer, it would be simply the start of a more detailed scrutiny of technical explanations, understanding of the nature of bugs, the reasons for a two-stage committal of data, and the reasons why those two stages were not always applied. I would not consider “working as designed” to be an acceptable answer.

The Post Office’s failure to grasp the nature of complex systems

A further revealing illustration of the Post Office’s attitude towards user error came in a witness statement provided for the Common Issues trial, the first of the two court cases brought by the Justice For Subpostmasters Alliance. This first trial was about the contractual relationship between the Post Office and subpostmasters. The statement came from Angela van den Bogerd. At the time she was People Services Director for the Post Office, but over the previous couple of decades she had been in senior technical positions, responsible for Horizon and its deployment. She described herself in court as “not an IT expert”. That is an interesting statement to consider alongside some of the comments in her witness statement.

“[78]… the Subpostmaster has complete control over the branch accounts and transactions only enter the branch accounts with the Subpostmaster’s (or his assistant’s) knowledge.

[92] I describe Horizon to new users as a big calculator. I think this captures the essence of the system in that it records the transactions inputted into it, and then adds or subtracts from the branch cash or stock holdings depending on whether it was a credit or debit transaction.”

“Complete control”? That confirms her admission that she is not an IT expert. I would never have been bold, or reckless, enough to claim that I was in complete control of any complex IT system for which I was responsible. The better I understood the system the less inclined I would be to make such a claim. Likening Horizon to a calculator is particularly revealing. See Principle 1 above. When I have tried to explain the nature of complex systems I have also used the calculator analogy, but as an illustration of what a complex system is not.

If a senior manager responsible for Horizon could use such a fundamentally mistaken analogy, and be prepared to insert it in a witness statement for a court case, it reveals how poorly equipped the Post Office management was to deal with the issues raised by Horizon. When we are confronted by complexity it is a natural reaction to try and construct a mental model that simplifies the problems and makes them understandable. This can be helpful. Indeed it is often essential if we are too make any sense of complexity. I have written about this here in my blog series “Dragons of the unknown”.

However, even the best models become dangerous if we lose sight of their limitations and start to think that they are exact representations of reality. They are no longer fallible aids to understanding, but become deeply deceptive.

If you think a computer system is like a calculator then you will approach problems with the wrong attitude. Calculators are completely reliable. Errors are invariably the result of users’ mistakes, “finger trouble”. That is exactly how senior Post Office managers, like Angela van den Bogerd, regarded the Horizon problems.

BugsZero

The Horizon scandal has implications for the argument that software developers can produce systems that have no bugs, that zero bugs is an attainable target. Arlo Belshee is a prominent exponent of this idea, of BugsZero as it is called. Here is a short introduction.

Before discussing anyone’s approach to bugs it is essential that we are clear what they mean by a bug. Belshee has a useful definition, which he provided in this talk in Singapore in 2016. (The conference website has a useful introduction to the talk.)

3:50 “The definition (of a bug) I use is anything that would frustrate, confuse or annoy a human and is potentially visible to a human other than the person who is currently actively writing (code).”

This definition is close to Justice Fraser’s (see above); “a bug is anything that means the software does not operate as users, or the corporation, expect”. However, I think that both definitions are limited.

BugsZero is a big topic, and I don’t have the time or expertise to do it justice, but for the purposes of this blog I’m happy to concede that it is possible for good coders to deliver exactly what they intend to, so that the code itself, within a particular program, will not act in ways that will “frustrate, confuse or annoy a human”, or at least a human who can detect the problem. That is the limitation of the definition. Not all faults with complex software will be detected. Some are not even detectable. Our inability to see them does not mean they are absent. Bugs can produce incorrect but plausible answers to calculations, or they can corrupt data, without users being able to see that a problem exists.

I speak from experience here. It might even be impossible for technical system experts to identify errors with confidence. It is not always possible to know whether a complex system is accurate. The insurance finance systems I used to work on were notoriously difficult to understand and manipulate. 100% accuracy was never a serious, practicable goal. As I wrote in “Fix on failure – a failure to understand failure”;

“With complex financial applications an honest and constructive answer to the question ‘is the application correct?’ would be some variant on ‘what do you mean by correct?’, or ‘I don’t know. It depends’. It might be possible to say the application is definitely not correct if it is producing obvious garbage. But the real difficulty is distinguishing between the seriously inaccurate, but plausible, and the acceptably inaccurate that is good enough to be useful. Discussion of accuracy requires understanding of critical assumptions, acceptable margins of error, confidence levels, the nature and availability of oracles, and the business context of the application.”

It is therefore misleading to define bugs as being potentially visible to users. Nevertheless, Belshee’s definition is useful provided that that qualification is accepted. However, in the same talk, Belshee goes on to make further statements I do not accept.

19:55 “A bug is an encoded developer mistake.”

28:50 “A bug is a mistake by a developer.”

This is a developer-centric view of systems. It is understandable if developers focus on the bugs for which they are responsible. However, if you look at the systems, and bugs, from the other end, from the perspective of users when a bug has resulted in frustration, confusion or annoyance, the responsibility for the problem is irrelevant. The disappointed human is uninterested in whether the problem is with the coding, the design, the interaction of programs or components, or whatever. All that matters is that the system is buggy.

There is a further complication. The coder may well have delivered code that was perfect when it was written and released. But perfect code can create later problems if the world in which it operates changes. See Principle 3 above. This aspect of software is not sufficiently appreciated; it has caused me a great deal of trouble in my career (see the section “Across time, not just at a point in time” in this blog, about working with Big Data).

Belshee does say that developers should take responsibility for bugs that manifest themselves elswhere, even if their code was written correctly. He also makes it clear, when talking about fault tolerant systems (17:22 in the talk above), that faults can arise “when the behaviour of the world is not as we expect”.

However he also says that the system “needs to work exactly as the developer thought if it’s going to recover”. That’s too high a bar for complex socio-technical systems. The most anyone can say, and it’s an ambitious target, is that the programs have been developed exactly as the developers intended. Belshee is correct at the program level; if the programs were not built as the developers intended then recovery will be very much harder. But at the system level we need to be clear and outspoken about the limits of what we can know, and about the inevitability that bugs are present.

If we start to raise hopes that systems might be perfect and bug-free because we believe that we can deliver perfectly written code then we are setting ourselves up for unpleasant recriminations when our users suffer from bugs. It is certainly laudable to eradicate sloppy and cavalier coding and it might be noble for developers to be willing to assume responsibility for all bugs. But it could leave them exposed to legal recriminations if the wider world believes that software developers can and should` ensure systems are free of bugs. This is where lawyers might become involved and that is why I’m unhappy about the name BugsZero, and the undeliverable promise that it implies.

Unnoticed and undetectable bugs in the Horizon case

The reality that a bug might be real and damaging but not detected by users, or even detectable by them, was discussed in the Horizon case.

“[972] Did the Horizon IT system itself alert Subpostmasters of such bugs, errors or defects… and if so how?

[973] Answer: Although the experts were agreed that the extent to which any IT system can automatically alert its users to bugs within the system itself is necessarily limited, and although Horizon has automated checks which would detect certain bugs, they were also agreed that there are types of bugs which would not be detected by such checks. Indeed, the evidence showed that some bugs lay undiscovered in the Horizon system for years. This issue is very easy, therefore, to answer. The correct answer is very short. The answer… is ‘No, the Horizon system did not alert SPMs’. The second part of the issue does not therefore arise.”

That is a significant extract from an important judgment. A senior judge directly addressed the question of system reliability and pronounced that he is satisfied that a complex system cannot be expected to have adequate controls to warn users of all errors.

This is more than an abstract, philosophical debate about proof, evidence and what we can know. In England there is a legal presumption that computer evidence is reliable. This made a significant contribution to the Horizon scandal. Both parties in a court case are obliged to disclose documents which might either support or undermine their case, so that the other side has a chance to inspect and challenge them. The Post Office and Fujitsu did not disclose anything that would have cast doubt on their computer evidence. That failure to disclose meant it was impossible for the subpostmasters being prosecuted to challenge the presumption that the evidence was reliable. The subpostmasters didn’t know about the relevant system problems, and they didn’t even know that that knowledge had been withheld from them.

Replacing the presumption of computer reliability

There are two broad approaches that can be taken in response to the presumption that computer evidence is reliable and the ease with which it can be abused, apart of course from ignoring the problem and accepting that injustice is a price worth paying for judicial convenience. England can scrap the presumption, which would require the party seeking to use the evidence to justify its reliability. Or the rules over disclosure can be strengthened to try and ensure that all relevant information about systems is revealed. Some blend of the two approaches seems most likely.

I have recently contributed to a paper entitled “Recommendations for the probity of computer evidence”. It has been submitted to the Ministry of Justice, which is responsible for the courts in England & Wales, and is available from the Digital Evidence and Electronic Signature Law Review.

The paper argues that the presumption of computer reliability should be replaced by a two stage approach when reliability is challenged. The provider of the data should first be required to provide evidence to demonstrate that they are in control of their systems, that they record and track all bugs, fixes, changes and releases, and that they have implemented appropriate security standards and processes.

If the party wishing to rely on the computer evidence cannot provide a satisfactory response in this first stage then the presumption of reliability should be reversed. The second stage would require them to prove that none of the failings revealed in the first stage might affect the reliability of the computer evidence.

Whatever approach is taken, IT professionals would have to offer an opinion on their systems. How reliable are the systems? What relevant evidence might there be that systems are reliable, or unreliable? Can they demonstrate that they are in control of their systems? Can they reassure senior managers who will have to put their name to a legal document and will be very keen to avoid the humiliation that has been heaped upon Post Office and Fujitsu executives, with the possibility of worse to come?

A challenging future

The extent to which we can rely on computers poses uncomfortable challenges for the English law now, but it will be an increasingly difficult problem for the IT world over the coming years. What can we reasonably say about the systems we work with? How reliable are they? What do we know about the bugs? Are we sufficiently clear about the nature of our systems to brief managers who will have to appear in court, or certify legal documents?

It will be essential that developers and testers are clear in their own minds, and in their communications, about what bugs are. They are not just coding errors, and we must try to ensure people outside IT understand that. Testers must also be able to communicate clearly what they have learned about systems, and they must never say or do anything that suggests systems will be free of bugs.

Testers will have to think carefully about the quality of their evidence, not just about the quality of their systems. How good is our evidence? How far can go in making confident statements of certainty? What do we still not know, and what is the significance of that missing knowledge? Much of this will be a question of good management. But organisations will need good testers, very good testers, who can explain what we know, and what we don’t know, about complex systems; testers who have the breadth of knowledge, the insight, and the communication skills to tell a convincing story to those who require the unvarnished truth.

We will need confident, articulate testers who can explain that a lack of certainty about how complex systems will behave is an honest, clear sighted, statement of truth. It is not an admission of weakness or incompetence. Too many people in IT have built good careers on bullshitting, on pretending they are more confident and certain than they have any right to be. Systems will inevitably become more complex. IT people will increasingly be dragged into litigation, and as the Post Office and Fujitsu executives have found, misplaced and misleading confidence and bluster in court have excruciating personal consequences. Unpleasant though these consequences are, they hardly compare with the tragedies endured by the subpostmasters and subpostmistresses, whose lives were ruined by corporations who insisted that their complex software was reliable.

The future might be difficult, even stressful, for software testers, but they will have a valuable, essential role to play in helping organisations and users to gain a better understanding of the fallible nature of software. To say the future will be interesting is an understatement; it will present exciting challenges and there should be great opportunities for the best testers.

June 5, 2020 by James Christie

Teachers, children, testers and leaders (2013)

This article appeared in the March 2013 edition of Testing Planet, which is published by the wonderful Ministry of Testing, one of the most exciting developments in software testing over the last 20 years.

I’m moving this article onto my blog from my website, which will shortly be decommissioned. The article was written in January 2013. Looking at it again I see that I was starting to develop arguments I fleshed out over the next couple of years as part of the Stop 29119 campaign against the testing standard, ISO 29119.

The article

“A tester is someone who knows things can be different” – Gerald Weinberg.

Leaders aren’t necessarily people who do things, or order other people about. To me the important thing about leaders is that they enable other people to do better, whether by inspiration, by example or just by telling them how things can be different – and better. The difference between a leader and a manager is like the difference between a great teacher and, well, the driver of the school bus. Both take children places, but a teacher can take children on a journey that will transform their whole life.

My first year or so in working life after I left university was spent in a fog of confusion. I struggled to make sense of the way companies worked; I must be more stupid than I’d always thought. All these people were charging around, briskly getting stuff done, making money and keeping the world turning; they understood what they were doing and what was going on. They must be smarter than me.

Gradually it dawned on me that very many of them hadn’t a clue. They were no wiser than me. They didn’t really know what was going on either. They thought they did. They had their heads down, working hard, convinced they were contributing to company profits, or at least keeping the losses down.

The trouble was their efforts often didn’t have much to do with the objectives of the organisation, or the true goals of the users and the project in the case of IT. Being busy was confused with being useful. Few people were capable of sitting back, looking at what was going on and seeing what was valuable as opposed to mere work creation.

I saw endless cases of poor work, sloppy service and misplaced focus. I became convinced that we were all working hard doing unnecessary, and even harmful, things for users who quite rightly were distinctly ungrateful. It wasn’t a case of the end justifying the means; it was almost the reverse. The means were only loosely connected to the ends, and we were focussing obsessively on the means without realising that our efforts were doing little to help us achieve our ends.

Formal processes didn’t provide a clear route to our goal. Following the process had become the goal itself. I’m not arguing against processes; just the attitude we often bring to them, confusing the process with the destination, the map with the territory. The quote from Gerald Weinberg absolutely nails the right attitude for testers to bring to their work. There are twin meanings. Testers should know there is a difference between what people expect, or assume, and what really is. They should also know that there is a difference between what is, and what could be.

Testers usually focus on the first sort of difference; seeing the product for what it really is and comparing that to what the users and developers expected. However, the second sort of difference should follow on naturally. What could the product be? What could we be doing better?

Testers have to tell a story, to communicate not just the reality to the stakeholders, but also a glimpse of what could be. Organisations need people who can bring clear headed thinking to confusion, standing up and pointing out that something is wrong, that people are charging around doing the wrong things, that things could be better. Good testers are well suited by instinct to seeing what positive changes are possible. Communicating these possibilities, dispelling the fog, shining a light on things that others would prefer to remain in darkness; these are all things that testers can and should do. And that too is a form of leadership, every bit as much as standing up in front of the troops and giving a rousing speech.

In Hans Christian’s Andersen’s story, the Emperor’s New Clothes, who showed a glimpse of leadership? Not the emperor, not his courtiers; it was the young boy who called out the truth, that the Emperor was wearing no clothes at all. If testers are not prepared to tell it like it is, to explain why things are different from what others are pretending, to explain how they could be better then we diminish and demean our profession. Leaders do not have to be all-powerful figures. They can be anyone who makes a difference; teachers, children. Or even testers.

June 5, 2020 by James Christie

Quality isn’t something, it provides something (2012)

This article appeared in the July 2012 edition of Testing Planet, which is published by the wonderful Ministry of Testing, one of the most exciting developments in software testing over the last 20 years.

The article was written in June 2012, but I don’t think it has dated. It’s about the way we think and work with other people. These are timeless problems. The idea behind E-prime is particularly interesting. Dispensing with the verb “to be” isn’t something to get obsessive or ideological about, but testers should be aware of the important distinction between the way something is and the way it behaves. The original article had only four references so I have checked them, converted them to hyperlinks, and changing the link to Lera Boroditsky’s paper to a link to her TED talk on the same subject.

The article

A few weeks ago two colleagues, who were having difficulty working together, asked me to act as peacekeeper in a tricky looking meeting in which they were going to try and sort out their working relationship. I’ll call them Tony and Paul. For various reasons they were sparking off each and creating antagonism that was damaging the whole team.

An hour’s discussion seemed to go reasonably well; Tony talking loudly and passionately, while Paul spoke calmly and softly. Just as I thought we’d reached an accommodation that would allow us all to work together Tony blurted out, “you are cold and calculating, Paul, that’s the problem”.

Paul reacted as if he’d been slapped in the face, made his excuses and left the meeting. I then spent another 20 minutes talking Tony through what had happened, before separately speaking to Paul about how we should respond.

I told Tony that if he’d wanted to make the point I’d inferred from his comments, and from the whole meeting, then he should have said “your behaviour and attitude towards me throughout this meeting, and when we work together, strike me as cold and calculating, and that makes me very uncomfortable”.

“But I meant that!”, Tony replied. Sadly, he hadn’t said that. Paul had heard the actual words and reacted to them, rather than applying the more dispassionate analysis I had used as an observer. Paul meanwhile found Tony’s exuberant volatility disconcerting, and responded to him in a very studied and measured style that unsettled Tony.

Tony committed two sins. Firstly, he didn’t acknowledge the two way nature of the problem. It should have been about how he reacted to Paul, rather than trying to dump all the responsibility onto Paul.

Secondly, he said that Paul is cold and calculating, rather than acting in a way Tony found cold, and calculating at a certain time, in certain circumstances.

I think we’d all see a huge difference between being “something”, and behaving in a “something” way at a certain time, in a certain situation. The verb “to be” gives us this problem. It can mean, and suggest, many different things and can create fog where we need clarity.

Some languages, such as Spanish, maintain a useful distinction between different forms of “to be” depending on whether one is talking about something’s identity or just a temporary attribute or state.

The way we think obviously shapes the language we speak, but increasingly scientists are becoming aware of how the language we use shapes the way that we think. [See this 2017 TED talk, “How Language Shapes Thought”, by Lera Boroditsky]

The problem we have with “to be” has great relevance to testers. I don’t just mean treating people properly, however much importance we rightly attach to working successfully with others. More than that, if we shy away from “to be” then it helps us think more carefully and constructively as testers.

This topic has stretched bigger brains than mine, in the fields of philosophy, psychology and linguistics. Just google “general semantics” if you want to give your brain a brisk workout. You might find it tough stuff, but I don’t think you have to master the underlying concept to benefit from its lessons.

Don’t think of it as intellectual navel gazing. All this deep thought has produced some fascinating results, in particular something called E-prime, a form of English that totally dispenses with “to be” in all its forms; no “I am”, “it is”, or “you are”. Users of E-prime don’t simply replace the verb with an alternative. That doesn’t work. It forces you to think and articulate more clearly what you want to say. [See this classic paper by Kellogg, “Speaking in E-prime” PDF, opens in new tab].

“The banana is yellow” becomes “the banana looks yellow”, which starts to change the meaning. “Banana” and “yellow” are not synonyms. The banana’s yellowness becomes apparent only because I am looking at it, and once we introduce the observer we can acknowledge that the banana appears yellow to us now. Tomorrow the banana might appear brown to me as it ripens. Last week it would have looked green.

You probably wouldn’t disagree with any of that, but you might regard it as a bit abstract and pointless. However, shunning “to be” helps us to think more clearly about the products we test, and the information that we report. E-prime therefore has great practical benefits.

The classic definition of software quality came from Gerald Weinburg in his book “Quality Software Management: Systems Thinking”.

“Quality is value to some person”.

Weinburg’s definition reflects some of the clarity of thought that E-prime requires, though he has watered it down somewhat to produce a snappy aphorism. The definition needs to go further, and “is” has to go!

Weinburg makes the crucial point that we must not regard quality as some intrinsic, absolute attribute. It arises from the value it provides to some person. Once you start thinking along those lines you naturally move on to realising that quality provides value to some person, at some moment in time, in a certain context.

Thinking and communicating in E-prime stops us making sweeping, absolute statements. We can’t say “this feature is confusing”. We have to use a more valuable construction such as “this feature confused me”. But we’re just starting. Once we drop the final, total condemnation of saying the feature is confusing, and admit our own involvement, it becomes more natural to think about and explain the reasons. “This feature confused me … when I did … because of …”.

Making the observer, the time and the context explicit help us by limiting or exposing hidden assumptions. We might or might not find these assumptions valid, but we need to test them, and we need to know about them so we understand what we are really learning as we test the product.

E-prime fits neatly with the scientific method and with the provisional and experimental nature of good testing. Results aren’t true or false. The evidence we gather matches our hypothesis, and therefore gives us greater confidence in our knowledge of the product, or it fails to match up and makes us reconsider what we thought we knew. [See this classic paper by Kellogg & Bourland, “Working with E-prime – some practical notes” PDF, opens in new tab].

Scientific method cannot be accommodated in traditional script-driven testing, which reflects a linear, binary, illusory worldview, pretending to be absolute. It tries to deal in right and wrong, pass and fail, true and false. Such an approach fits in neatly with traditional development techniques which fetishise the rigours of project management, rather than the rigours of the scientific method.

This takes us back to general semantics, which coined the well known maxim that the map is not the territory. Reality and our attempts to model and describe it differ fundamentally from each other. We must not confuse them. Traditional techniques fail largely because they confuse the map with the territory. [See this “Less Wrong” blog post].

In attempting to navigate their way through a complex landscape, exponents of traditional techniques seek the comfort of a map that turns messy, confusing reality into something they can understand and that offers the illusion of being manageable. However, they are managing the process, not the underlying real work. The plan is not the work. The requirements specification is not the requirements. The map is not the territory.

Adopting E-prime in our thinking and communication will probably just make us look like the pedantic awkward squad on a traditional project. But on agile or lean developments E-prime comes into its own. Testers must contribute constructively, constantly, and above all, early. E-prime helps us in all of this. It makes us clarify our thoughts and helps us understand that we gain knowledge provisionally, incrementally and never with absolute certainty.

I was not consciously deploying E-prime during and after the fractious meeting I described earlier. But I had absorbed the precepts sufficiently to instinctively realise that I had two problems; Tony’s response to Paul’s behaviour, and Paul’s response to Tony’s outburst. I really didn’t see it as a matter of “uh oh – Tony is stupid”.

E-prime purists will look askance at my failure to eliminate all forms of “to be” in this article. I checked my writing to ensure that I’ve written what I meant to, and said only what I can justify. Question your use of the verb, and weed out those hidden assumptions and sweeping, absolute statements that close down thought, rather than opening it up. Don’t think you have to be obsessive about it. As far as I am concerned, that would be silly!

June 3, 2020 by James Christie

Business logic security testing (2009)

This article appeared in the June 2009 edition of Testing Experience magazine and the October 2009 edition of Security Acts magazine.

If you choose to read the article please bear in mind that it was written in January 2009 and is therefore inevitably dated in some respects. In particular, ISACA has restructured COBIT, but it remains a useful source. Overall I think the arguments I made in this article are still valid.

The references in the article were all structured for a paper magazine. They were not set up as hyperlinks and I have not tried to recreate them and check out whether they still work.

The article

When I started in IT in the 80s the company for which I worked had a closed network restricted to about 100 company locations with no external connections.

Security was divided neatly into physical security, concerned with the protection of the physical assets, and logical security, concerned with the protection of data and applications from abuse or loss.

When applications were built the focus of security was on internal application security. The arrangements for physical security were a given, and didn’t affect individual applications.

There were no outsiders to worry about who might gain access, and so long as the common access controls software was working there was no need for analysts or designers to worry about unauthorized internal access.

Security for the developers was therefore a matter of ensuring that the application reflected the rules of the business; rules such as segregation of responsibilities, appropriate authorization levels, dual authorization of high value payments, reconciliation of financial data.

The world quickly changed and relatively simple, private networks isolated from the rest of the world gave way to more open networks with multiple external connections and to web applications.

Security consequently acquired much greater focus. However, it began to seem increasingly detached from the work of developers. Security management and testing became specialisms in their own right, and not just an aspect of technical management and support.

We developers and testers continued to build our applications, comforted by the thought that the technical security experts were ensuring that the network perimeter was secure.photo of business logic security article header

Nominally security testing was a part of non-functional testing. In reality, it had become somewhat detached from conventional testing.

According to the glossary of the British Computer Society’s Special Interest Group in Software Testing (BCS SIGIST) [1], security testing is determining whether the application meets the specified security requirements.

SIGIST also says that security entails the preservation of confidentiality, integrity and availability of information. Availability means ensuring that authorized users have access to information and associated assets when required. Integrity means safeguarding the accuracy and completeness of information and processing methods. Confidentiality means ensuring that information is accessible only to those authorized to have access.

Penetration testing, and testing the security of the network and infrastructure, are all obviously important, but if you look at security in the round, bearing in mind wider definitions of security (such as SIGIST’s), then these activities can’t be the whole of security testing.

Some security testing has to consist of routine functional testing that is purely a matter of how the internals of the application work. Security testing that is considered and managed as an exercise external to the development, an exercise that follows the main testing, is necessarily limited. It cannot detect defects that are within the application rather than on the boundary.

Within the application, insecure design features or insecure coding might be detected without any deep understanding of the application’s business role. However, like any class of requirements, security requirements will vary from one application to another, depending on the job the application has to do.

If there are control failures that reflect poorly applied or misunderstood business logic, or business rules, then will we as functional testers detect that? Testers test at the boundaries. Usually we think in terms of boundary values for the data, the boundary of the application or the network boundary with the outside world.

Do we pay enough attention to the boundary of what is permissible user behavior? Do we worry enough about abuse by authorized users, employees or outsiders who have passed legitimately through the network and attempt to subvert the application, using it in ways never envisaged by the developers?

I suspect that we do not, and this must be a matter for concern. A Gartner report of 2005 [2] claimed that 75% of attacks are at the application level, not the network level. The types of threats listed in the report all arise from technical vulnerabilities, such as command injection and buffer overflows.

Such application layer vulnerabilities are obviously serious, and must be addressed. However, I suspect too much attention has been given to them at the expense of vulnerabilities arising from failure to implement business logic correctly.

This is my main concern in this article. Such failures can offer great scope for abuse and fraud. Security testing has to be about both the technology and the business.

Problem of fraud and insider abuse

It is difficult to come up with reliable figures about fraud because of its very nature. According to PriceWaterhouseCoopers in 2007 [3] the average loss to fraud by companies worldwide over the two years from 2005 was $2.4 million (their survey being biased towards larger companies). This is based on reported fraud, and PWC increased the figure to $3.2 million to allow for unreported frauds.

In addition to the direct costs there were average indirect costs in the form of management time of $550,000 and substantial unquantifiable costs in terms of damage to the brand, staff morale, reduced share prices and problems with regulators.

PWC stated that 76% of their respondents reported the involvement of an outside party, implying that 24% were purely internal. However, when companies were asked for details on one or two frauds, half of the perpetrators were internal and half external.

It would be interesting to know the relative proportions of frauds (by number and value) which exploited internal applications and customer facing web applications but I have not seen any statistics for these.

The U.S. Secret Service and CERT Coordination Center have produced an interesting series of reports on “illicit cyber activity”. In their 2004 report on crimes in the US banking and finance sector [4] they reported that in 70% of the cases the insiders had exploited weaknesses in applications, processes or procedures (such as authorized overrides). 78% of the time the perpetrators were authorized users with active accounts, and in 43% of cases they were using their own account and password.

The enduring problem with fraud statistics is that many frauds are not reported, and many more are not even detected. A successful fraud may run for many years without being detected, and may never be detected. A shrewd fraudster will not steal enough money in one go to draw attention to the loss.

I worked on the investigation of an internal fraud at a UK insurance company that had lasted 8 years, as far back as we were able to analyze the data and produce evidence for the police. The perpetrator had raised 555 fraudulent payments, all for less than £5,000 and had stolen £1.1 million pounds by the time that we received an anonymous tip off.

The control weaknesses related to an abuse of the authorization process, and a failure of the application to deal appropriately with third party claims payments, which were extremely vulnerable to fraud. These weaknesses would have been present in the original manual process, but the users and developers had not taken the opportunities that a new computer application had offered to introduce more sophisticated controls.

No-one had been negligent or even careless in the design of the application and the surrounding procedures. The trouble was that the requirements had focused on the positive functions of the application, and on replicating the functionality of the previous application, which in turn had been based on the original manual process. There had not been sufficient analysis of how the application could be exploited.

Problem of requirements and negative requirements

Earlier I was careful to talk about failure to implement business logic correctly, rather than implementing requirements. Business logic and requirements will not necessarily be the same.

The requirements are usually written as “the application must do” rather than “the application must not…”. Sometimes the “must not” is obvious to the business. It “goes without saying” – that dangerous phrase!

However, the developers often lack the deep understanding of business logic that users have, and they design and code only the “must do”, not even being aware of the implicit corollary, the “must not”.

As a computer auditor I reviewed a sales application which had a control to ensure that debts couldn’t be written off without review by a manager. At the end of each day a report was run to highlight debts that had been cleared without a payment being received. Any discrepancies were highlighted for management action.

I noticed that it was possible to overwrite the default of today’s date when clearing a debt. Inserting a date in the past meant that the money I’d written off wouldn’t appear on any control report. The report for that date had been run already.

When I mentioned this to the users and the teams who built and tested the application the initial reaction was “but you’re not supposed to do that”, and then they all tried blaming each other. There was a prolonged discussion about the nature of requirements.

The developers were adamant that they’d done nothing wrong because they’d built the application exactly as specified, and the users were responsible for the requirements.

The testers said they’d tested according to the requirements, and it wasn’t their fault.

The users were infuriated at the suggestion that they should have to specify every last little thing that should be obvious – obvious to them anyway.

The reason I was looking at the application, and looking for that particular problem, was because we knew that a close commercial rival had suffered a large fraud when a customer we had in common had bribed an employee of our rival to manipulate the sales control application. As it happened there was no evidence that the same had happened to us, but clearly we were vulnerable.

Testers should be aware of missing or unspoken requirements, implicit assumptions that have to be challenged and tested. Such assumptions and requirements are a particular problem with security requirements, which is why the simple SIGIST definition of security testing I gave above isn’t sufficient – security testing cannot be only about testing the formal security requirements.

However, testers, like developers, are working to tight schedules and budgets. We’re always up against the clock. Often there is barely enough time to carry out all the positive testing that is required, never mind thinking through all the negative testing that would be required to prove that missing or unspoken negative requirements have been met.

Fraudsters, on the other hand, have almost unlimited time to get to know the application and see where the weaknesses are. Dishonest users also have the motivation to work out the weaknesses. Even people who are usually honest can be tempted when they realize that there is scope for fraud.

If we don’t have enough time to do adequate negative testing to see what weaknesses could be exploited than at least we should be doing a quick informal evaluation of the financial sensitivity of the application and alerting management, and the internal computer auditors, that there is an element of unquantifiable risk. How comfortable are they with that?

If we can persuade project managers and users that we need enough time to test properly, then what can we do?

CobiT and OWASP

If there is time, there are various techniques that testers can adopt to try and detect potential weaknesses or which we can encourage the developers and users to follow to prevent such weaknesses.

I’d like to concentrate on the CobiT (Control Objectives for Information and related Technology) guidelines for developing and testing secure applications (CobiT 4.1 2007 [5]), and the CobiT IT Assurance Guide [6], and the OWASP (Open Web Application Security Project) Testing Guide [7].

Together, CobiT and OWASP cover the whole range of security testing. They can be used together, CobiT being more concerned with what applications do, and OWASP with how applications work.

They both give useful advice about the internal application controls and functionality that developers and users can follow. They can also be used to provide testers with guidance about test conditions. If the developers and users know that the testers will be consulting these guides then they have an incentive to ensure that the requirements and build reflect this advice.

CobiT implicitly assumes a traditional, big up-front design, Waterfall approach. Nevertheless, it’s still potentially useful for Agile practitioners, and it is possible to map from CobiT to Agile techniques, see Gupta [8].

The two most relevant parts are in the CobiT IT Assurance Guide [6]. This is organized into domains, the most directly relevant being “Acquire and Implement” the solution. This is really for auditors, guiding them through a traditional development, explaining the controls and checks they should be looking for at each stage.

It’s interesting as a source of ideas, and as an alternative way of looking at the development, but unless your organization has mandated the developers to follow CobiT there’s no point trying to graft this onto your project.

Of much greater interest are the six CobiT application controls. Whereas the domains are functionally separate and sequential activities, a life-cycle in effect, the application controls are statements of intent that apply to the business area and the application itself. They can be used at any stage of the development. They are;

AC1 Source Data Preparation and Authorization

AC2 Source Data Collection and Entry

AC3 Accuracy, Completeness and Authenticity Checks

AC4 Processing Integrity and Validity

AC5 Output Review, Reconciliation and Error Handling

AC6 Transaction Authentication and Integrity

Each of these controls has stated objectives, and tests that can be made against the requirements, the proposed design and then on the built application. Clearly these are generic statements potentially applicable to any application, but they can serve as a valuable prompt to testers who are willing to adapt them to their own application. They are also a useful introduction for testers to the wider field of business controls.

CobiT rather skates over the question of how the business requirements are defined, but these application controls can serve as a useful basis for validating the requirements.

Unfortunately the CobiT IT Assurance Guide can be downloaded for free only by members of ISACA (Information Systems Audit and Control Association) and costs $165 for non-members to buy. Try your friendly neighborhood Internal Audit department! If they don’t have a copy, well maybe they should.

If you are looking for a more constructive and proactive approach to the requirements then I recommend the Open Web Application Security Project (OWASP) Testing Guide [7]. This is an excellent, accessible document covering the whole range of application security, both technical vulnerabilities and business logic flaws.

It offers good, practical guidance to testers. It also offers a testing framework that is basic, and all the better for that, being simple and practical.

The OWASP testing framework demands early involvement of the testers, and runs from before the start of the project to reviews and testing of live applications.

Phase 1: Before Deployment begins

1A: Review policies and standards

1B: Develop measurement and metrics criteria (ensure traceability)

Phase 2: During definition and design

2A: Review security requirements

2B: Review design and architecture

2C: Create and review UML models

2D: Create and review threat models

Phase 3: During development

3A: Code walkthroughs

3B: Code reviews

Phase 4: During development

4A: Application penetration testing

4B: Configuration management testing

Phase 5: Maintenance and operations

5A: Conduct operational management reviews

5B: Conduct periodic health checks

5C: Ensure change verification

OWASP suggests four test techniques for security testing; manual inspections and reviews, code reviews, threat modeling and penetration testing. The manual inspections are reviews of design, processes, policies, documentation and even interviewing people; everything except the source code, which is covered by the code reviews.

A feature of OWASP I find particularly interesting is its fairly explicit admission that the security requirements may be missing or inadequate. This is unquestionably a realistic approach, but usually testing models blithely assume that the requirements need tweaking at most.

The response of OWASP is to carry out what looks rather like reverse engineering of the design into the requirements. After the design has been completed testers should perform UML modeling to derive use cases that “describe how the application works.

In some cases, these may already be available”. Obviously in many cases these will not be available, but the clear implication is that even if they are available they are unlikely to offer enough information to carry out threat modeling.
The feature most likely to be missing is the misuse case. These are the dark side of use cases! As envisaged by OWASP the misuse cases shadow the use cases, threatening them, then being mitigated by subsequent use cases.

The OWASP framework is not designed to be a checklist, to be followed blindly. The important point about using UML is that it permits the tester to decompose and understand the proposed application to the level of detail required for threat modeling, but also with the perspective that threat modeling requires; i.e. what can go wrong? what must we prevent? what could the bad guys get up to?

UML is simply a means to that end, and was probably chosen largely because that is what most developers are likely to be familiar with, and therefore UML diagrams are more likely to be available than other forms of documentation. There was certainly some debate in the OWASP community about what the best means of decomposition might be.

Personally, I have found IDEF0 a valuable means of decomposing applications while working as a computer auditor. Full details of this technique can be found at http://www.idef.com [9].

It entails decomposing an application using a hierarchical series of diagrams, each of which has between three and six functions. Each function has inputs, which are transformed into outputs, depending on controls and mechanisms.
Is IDEF0 as rigorous and effective as UML? No, I wouldn’t argue that. When using IDEF0 we did not define the application in anything like the detail that UML would entail. Its value was in allowing us to develop a quick understanding of the crucial functions and issues, and then ask pertinent questions.

Given that certain inputs must be transformed into certain outputs, what are the controls and mechanisms required to ensure that the right outputs are produced?

In working out what the controls were, or ought to be, we’d run through the mantra that the output had to be accurate, complete, authorized, and timely. “Accurate” and “complete” are obvious. “Authorized” meant that the output must have been created or approved by people with the appropriate level of authority. “Timely” meant that the output must not only arrive in the right place, but at the right time. One could also use the six CobiT application controls as prompts.

In the example I gave above of the debt being written off I had worked down to the level of detail of “write off a debt” and looked at the controls required to produce the right output, “cancelled debts”. I focused on “authorized”, “complete” and “timely”.

Any sales operator could cancel a debt, but that raised the item for management review. That was fine. The problem was with “complete” and “timely”. All write-offs had to be collected for the control report, which was run daily. Was it possible to ensure some write-offs would not appear? Was it possible to over-key the default of the current date? It was possible. If I did so, would the write-off appear on another report? No. The control failure therefore meant that the control report could be easily bypassed.

The testing that I was carrying out had nothing to do with the original requirements. They were of interest, but not really relevant to what I was trying to do. I was trying to think like a dishonest employee, looking for a weakness I could exploit.

The decomposition of the application is the essential first step of threat modeling. Following that, one should analyze the assets for importance, explore possible vulnerabilities and threats, and create mitigation strategies.

I don’t want to discuss these in depth. There is plenty of material about threat modeling available. OWASP offers good guidance, [10] and [11]. Microsoft provides some useful advice [12], but its focus is on technical security, whereas OWASP looks at the business logic too. The OWASP testing guide [7] has a section devoted to business logic that serves as a useful introduction.

OWASP’s inclusion of mitigation strategies in the version of threat modeling that it advocates for testers is interesting. This is not normally a tester’s responsibility. However, considering such strategies is a useful way of planning the testing. What controls or protections should we be testing for? I think it also implicitly acknowledges that the requirements and design may well be flawed, and that threat modeling might not have been carried out in circumstances where it really should have been.

This perception is reinforced by OWASP’s advice that testers should ensure that threat models are created as early as possible in the project, and should then be revisited as the application evolves.

What I think is particularly valuable about the application control advice in CobIT and OWASP is that they help us to focus on security as an attribute that can, and must, be built into applications. Security testing then becomes a normal part of functional testing, as well as a specialist technical exercise. Testers must not regard security as an audit concern, with the testing being carried out by quasi-auditors, external to the development.

Getting the auditors on our side

I’ve had a fairly unusual career in that I’ve spent several years in each of software development, IT audit, IT security management, project management and test management. I think that gives me a good understanding of each of these roles, and a sympathetic understanding of the problems and pressures associated with them. It’s also taught me how they can work together constructively.

In most cases this is obvious, but the odd one out is the IT auditor. They have the reputation of being the hard-nosed suits from head office who come in to bayonet the wounded after a disaster! If that is what they do then they are being unprofessional and irresponsible. Good auditors should be pro-active and constructive. They will be happy to work with developers, users and testers to help them anticipate and prevent problems.

Auditors will not do your job for you, and they will rarely be able to give you all the answers. They usually have to spread themselves thinly across an organization, inevitably concentrating on the areas with problems and which pose the greatest risk.

They should not be dictating the controls, but good auditors can provide useful advice. They can act as a valuable sounding board, for bouncing ideas off. They can also be used as reinforcements if the testers are coming under irresponsible pressure to restrict the scope of security testing. Good auditors should be the friend of testers, not our enemy. At least you may be able to get access to some useful, but expensive, CobiT material.

Auditors can give you a different perspective and help you ask the right questions, and being able to ask the right questions is much more important than any particular tool or method for testers.

This article tells you something about CobiT and OWASP, and about possible new techniques for approaching testing of security. However, I think the most important lesson is that security testing cannot be a completely separate specialism, and that security testing must also include the exploration of the application’s functionality in a skeptical and inquisitive manner, asking the right questions.

Validating the security requirements is important, but so is exposing the unspoken requirements and disproving the invalid assumptions. It is about letting management see what the true state of the application is – just like the rest of testing.

References

[1] British Computer Society’s Special Interest Group in Software Testing (BCS SIGIST) Glossary.

[2] Gartner Inc. “Now Is the Time for Security at the Application Level” (NB PDF download), 2005.

[3] PriceWaterhouseCoopers. “Economic crime- people, culture and controls. The 4th biennial Global Economic Crime Survey”.

[4] US Secret Service. “Insider Threat Study: Illicit Cyber Activity in the Banking and Finance Sector”.

[5] IT Governance Institute. CobiT 4.1, 2007.

[6] IT Governance Institute. CobiT IT Assurance Guide (not free), 2007.

[7] Open Web Application Security Project. OWASP Testing Guide, V3.0, 2008.

[8] Gupta, S. “SOX Compliant Agile Processes”, Agile Alliance Conference, Agile 2008.

[9] IDEF0 Function Modeling Method.

[10] Open Web Application Security Project. OWASP Threat Modeling, 2007.

[11] Open Web Application Security Project. OWASP Code Review Guide “Application Threat Modeling”, 2009.

[12] Microsoft. “Improving Web Application Security: Threats and Countermeasures”, 2003.

March 7, 2019 by James Christie

The dragons of the unknown; part 9 – learning to live with the unknowable

Introduction

This is the ninth and final post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “facing the dragons part 1 – corporate bureaucracies”. Part 2 was about the nature of complex systems. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “I don’t know what’s going on”.

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “accident investigations and treating people fairly”, looked at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

Part six “Safety II, a new way of looking at safety” looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

The seventh post “Resilience requires people” is about the importance of system resilience and the vital role that people play in keeping systems going.

The eighth post “How we look at complex systems” is about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

This final post will try to draw all these strands together and present some thoughts about the future of testing as we are increasingly confronted with complex systems that are beyond our ability to comprehend.

Computing will become more complex

Even if we choose to focus on the simpler problems, rather than help users understand complexity, the reality is that computing is only going to get more complex. The problems that users of complex socio-technical systems have to grapple with will inevitably get more difficult and more intractable. The choice is whether we want to remain relevant, but uncomfortable, or go for comfortable bullshit that we feel we can handle. Remember Zadeh’s Law of Incompatibility (see part 7 – resilience requires people). “As complexity increases, precise statements lose their meaning, and meaningful statements lose precision”. Quantum computing, artificial intelligence and adaptive algorithms are just three of the areas of increasing importance whose inherent complexity will make it impossible for testers to offer opinions that are both precise and meaningful.

Quantum computing, in particular, is fascinating. By its very nature it is probabilistic, not deterministic. The idea that well designed and written programs should always produce the same output from the same data is relevant only to digital computers (and even then the maxim has to be heavily qualified in practice); it never holds true at any level for quantum computers. I wrote about this in “Quantum computing; a whole new field of bewilderment”.

The final quote from that article, “perplexity is the beginning of knowledge”, applies not only to quantum computing but also to artificial intelligence and the fiendish complexity of algorithms processing big data. One of the features of quantum computing is the way that changing a single qubit, the equivalent of digital bytes, will trigger changes in other qubits. This is entanglement, but the same word is now being used to describe the incomprehensible complexity of modern digital systems. Various writers have talked about this being the Age of Entanglement, eg Samuel Arbesman, in his book “Overcomplicated: Technology at the Limits of Comprehension)”, Emmet Connolly, in an article “Design in the Age of Entanglement” and Danny Hillis, in an article “The Enlightenment is Dead, Long Live the Entanglement”.

The purist in me disapproves of recycling a precise term from quantum science to describe loosely a phenomenon in digital computing. However, it does serve a useful role. It is a harsh and necessary reminder and warning that modern systems have developed beyond our ability to understand them. They are no more comprehensible than quantum systems, and as Richard Feynman is popularly, though possibly apocryphally, supposed to have said; “If you think you understand quantum physics, you don’t understand quantum physics.”

So the choice for testers will increasingly be to decide how we respond to Zadeh’s Law. Do we offer answers that are clear, accurate, precise and largely useless to the people who lose sleep at night worrying about risks? Or do we say “I don’t know for sure, and I can’t know, but this is what I’ve learned about the dangers lurking in the unknown, and what I’ve learned about how people will try to stay clear of these dangers, and how we can help them”?

If we go for the easy options and restrict our role to problems which allow definite answers then we will become irrelevant. We might survive as process drones, holders of a “bullshit job” that fits neatly into the corporate bureaucracy but offers little of value. That will be tempting in the short to medium term. Large organisations often value protocol and compliance more highly than they value technical expertise. That’s a tough problem to deal with. We have to confront that and communicate why that worldview isn’t just dated, it’s wrong. It’s not merely a matter of fashion.

If we’re not offering anything of real value then there are two possible dangers. We will be replaced by people prepared to do poor work cheaper; if you’re doing nothing useful then there is always someone who can undercut you. Or we will be increasingly replaced by automation because we have opted to stay rooted in the territory where machines can be more effective, or at least efficient.

If we fail to deal with complexity the danger is that mainstream testing will be restricted to “easy” jobs – the dull, boring jobs. When I moved into internal audit I learned to appreciate the significance of all the important systems being inter-related. It was where systems interfaced, and when people were involved that they got interesting. The finance systems with which I worked may have been almost entirely batch based, but they performed a valuable role for the people with whom we were constantly discussing the behaviour of these systems. Anything standalone was neither important nor particularly interesting. Anything that didn’t leave smart people scratching their heads and worrying was likely to be boring. Inter-connectedness and complexity will only increase and however difficult testing becomes it won’t be boring – so long as we are doing a useful job.

If we want to work with the important, interesting systems then we have to deal with complexity and the sort of problems the safety people have been wrestling with. There will always be a need for people to learn and inform others about complex systems. The American economist Tyler Cowen in his book “Average is Over” states the challenge clearly. We will need interpreters of complex systems.

“They will become clearing houses for and evaluators of the work of others… They will hone their skills of seeking out, absorbing and evaluating information… They will be translators of the truths coming out of our network of machines… At least for a while they will be the only people who will have a clear notion of what is going on.”

I’m not comfortable with the idea of truths coming out of machines, and we should resist the idea that we can ever be entirely clear about what is going on. But the need for experts who can interpret complex systems is clear. Society will look for them. Testers should aspire to be among those valuable specialists. The complexity of these systems will be beyond the ability of any one person to comprehend, but perhaps these interpreters, in addition to deploying their own skills, will be able to act like a conductor of an orchestra, to return to the analogy I used in part seven (Resilience requires people). Conductors are talented musicians in their own right, but they call on the expertise of different specialists, blending their contribution to produce something of value to the audience. Instead of a piece of music the interpreter tester would produce a story that sheds light on the system, guiding the people who need to know.

Testers in the future will have to be confident and assertive when they try to educate others about complexity, the inexplicable and the unknowable. Too often in corporate life a lack of certainty has been seen as a weakness. We have to stand our ground and insist on our right to be heard and taken seriously when we argue that certainty cannot be available if we want to talk about the risks and problems that matter. My training and background meant I couldn’t keep quiet when I saw problems, that were being ignored because no-one knew how to deal with them. As Better Software said about me, I’m never afraid to voice my opinion. better software says I am never afraid to voice my opinion

Never be afraid to speak out, to explain why your experience and expertise make your opinions valuable, however uncomfortable these may be for others. That’s what you’re paid for, not to provide comforting answers. The metaphor of facing the dragons of the unknown is extremely important. People will have to face these dragons. Testers have a responsibility to try and shed as much light as possible on those dragons lurking in the darkness beyond what we can see and understand easily. If we concentrate only on what we can know and say with certainty it means we walk away from offering valuable, heavily qualified advice about the risks, threats & opportunities that matter to people. Our job should entail trying to help and protect people. As Jerry Weinberg said in “Secrets of Consulting”;

“No matter what they tell you, it’s always a people problem.”

March 7, 2019 by James Christie

The dragons of the unknown; part 8 – how we look at complex systems

Introduction

This is the eighth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The seventh post “Resilience requires people” is about the importance of system resilience and the vital role that people play in keeping systems going.

This eighth post is about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

Choosing what we look at

The ideas I’ve been writing about resonated strongly with me when I first read about the safety and resilience engineering communities. What unites them is a serious, mature awareness of the importance of their work. Compared to these communities I sometimes feel as if normal software developers and testers are like children playing with cool toys while the safety critical engineers are the adults worrying about the real world.

The complex insurance finance systems I worked with were part of a wider system with correspondingly more baffling complexity. Remember the comments of Professor Michael McIntyre (in part six, “Safety II, a new way of looking at safety”).

“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”

If we zoom out for a wider perspective in both space and time we can see that objects which looked simple and static are part of a fluid, dynamic process. We can choose where we place the boundaries of the systems we want to learn about. We should make that decision based on where we can offer most value, not where the answers are easiest. We should not be restricting ourselves to problems that allow us to make definite, precise statements. We shouldn’t be looking only where the light is good, but also in the darkness. We should be peering out into the unknown where there may be dragons and dangers lurking.
drunkard looking under the streetlight
Taking the wider perspective, the insurance finance systems for which I was responsible were essentially control mechanisms to allow statisticians to continually monitor and fine tune the rates, the price at which we sold insurance. They were continually searching for patterns, trying to understand how the different variables played off each other. We made constant small adjustments to keep the systems running effectively. We had to react to business problems that the systems revealed to our users, and to technical problems caused by all the upstream feeding applications. Nobody could predict the exact result of adjustments, but we learned to predict confidently the direction; good or bad.

The idea of testing these systems with a set of test cases, with precisely calculated expected results, was laughably naïve. These systems weren’t precise or accurate in a simple book-keeping sense, but they were extremely valuable. If we as developers and testers were to do a worthwhile job for our users we couldn’t restrict ourselves to focusing on whether the outputs from individual programs matched our expectations, which were no more likely to be “correct” (whatever that might mean in context) than the output.

Remember, these systems were performing massively complex calculations on huge volumes of data and thus producing answers that were not available any other way. We could predict how an individual record would be processed, but putting small numbers of records through the systems would tell us nothing worthwhile. Rounding errors would swamp any useful information. A change to a program that introduced a serious bug would probably produce a result that was indistinguishable from correct output with a small sample of data, but introduce serious and unacceptable error when we were dealing with the usual millions of records.

We couldn’t spot patterns from a hundred records using programs designed to tease out patterns from datasets with millions of records. We couldn’t specify expected outputs from systems that are intended to help us find out about unknown unknowns.

The only way to generate predictable output was to make unrealistic assumptions about the input data, to distort it so it would fit what we thought we knew. We might do that in unit testing but it was pointless in more rigorous later testing. We had to lift our eyes and understand the wider context, the commercial need to compete in the insurance marketplace with rates that were set with reasonable confidence in the accuracy of the pricing of the risks, rather than being set by guesswork, as had traditionally been the case. We were totally reliant on the expertise of our users, who in turn were totally reliant on our technical skills and experience.

I had one difficult, but enlightening, discussion with a very experienced and thoughtful user. I asked her to set acceptance criteria for a new system. She refused and continued to refuse even when I persisted. I eventually grasped why she wouldn’t set firm criteria. She couldn’t possibly articulate a full set of criteria. Inevitably she would only be able to state a small set of what might be relevant. It was misleading to think only in terms of a list of things that should be acceptable. She also had to think about the relationships between different factors. The permutations were endless, and spotting what might be wrong was a matter of experience and deep tacit knowledge.

This user was also concerned that setting a formal set of acceptance criteria would lead me and my team to focus on that list, which would mean treating the limited set of knowledge that was explicit as if it were the whole. We would be looking for confirmation of what we expected, rather than trying to help her shed light on the unknown.

Dealing with the wider context and becoming comfortable with the reality that we were dealing with the unknown was intellectually demanding and also rewarding. We had to work very closely with our users and we built strong, respectful and trusting relationships that ran deep and lasted long. When we hit serious problems, those good relations were vital. We could all work together, confident in each other’s abilities. These relationships have lasted many years, even though none of us still work for the same company.

We had to learn as much as possible from the users. This learning process was never ending. We were all learning, both users and developers, all the time. The more we learned about our systems the better we could understand the marketplace. The more we learned about how the business was working in the outside world the better our fine tuning of the systems.

Devops – a future reminiscent of my past?

With these complex insurance finance systems the need for constant learning dominated the whole development lifecyle to such an extent that we barely thought in terms of a testing phase. Some of our automated tests were built into the production system to monitor how it was running. We never talked of “testing in production”. That was a taboo phrase. Constant monitoring? Learning in production? These were far more diplomatic ways of putting it. However, the frontier between development and production was so blurred and arbitrary that we once, under extreme pressure of time, went to the lengths of using what were officially test runs to feed the annual high level business planning. This was possible only because of a degree of respect and trust between users, developers and operations staff that I’ve never seen before or since.

That close working relationship didn’t happen by chance. Our development team was pulled out of Information Services, the computing function, and assigned to the business, working side by side with the insurance statisticians. Our contact in operations wasn’t similarly seconded, but he was permanently available and was effectively part of the team.

The normal development standards and methods did not apply to our work. The company recognised that they were not appropriate and we were allowed to feel our way and come up with methods that would work for us. I wrote more about this a few years ago in “Adventures with Big Data”.

When Devops broke onto the scene I was fascinated. It is a response not only to the need for continuous delivery, but also to the problems posed by working with increasingly complex and intractable systems. I could identify with so much; the constant monitoring, learning about the system in production, breaking down traditional structures and barriers, different disciplines working more closely together. None of that seemed new to me. These had felt like a natural way to develop the deeply complicated insurance finance systems that would inevitably evolve into creatures as complex as the business environment in which they helped us to survive.

I’ve found Noah Sussman’s work very helpful. He has explicitly linked Devops with the ideas I have been discussing (in this whole series) that have emerged from the resilience engineering and safety critical communities. In particular, Sussman has picked up on an argument that Sidney Dekker has been making, notably in his book “Safety Differently”, that nobody can have a clear idea of how complex sociotechnical systems are working. There cannot be a single, definitive and authoritative (ie canonical) description of the system. The view that each expert has, as they try to make the system work, is valid but it is inevitably incomplete. Sussman put it as follows in his blog series “Software as Narrative”.

“At the heart of Devops is the admission that no single actor can ever obtain a ‘canonical view’ of an incident that took place during operations within an intractably complex sociotechnical system such as a software organization, hospital, airport or oil refinery (for instance).”

Dekker describes this as ontological relativism. The terminology from philosophy might seem intimidating, but anyone who has puzzled their way through a production problem in a complex system in the middle of the night should be able to identify with it. Brian Fay (in “Contemporary Philosophy of Social Science”) defines ontological relativism as meaning “reality itself is thought to be determined by the particular conceptual scheme of those living within it”.

If you’ve ever been alone in the deep of the night, trying to make sense of an intractable problem that has brought a complex system down, you’ll know what it feels like to be truly alone, to be dependent on your own skills and experience. The system documentation is of limited help. The insights of other people aren’t much use. They aren’t there, and the commentary they’ve offered in the past reflected their own understanding that they have constructed of how the system works. The reality that you have to deal with is what you are able to make sense of. What matters is your understanding, your own mental model.

I was introduced to this idea that we use mental models to help us gain traction with intractable systems by David Woods’ work. He (along with co-authors Paul Feltovich, Robert Hoffman and Axel Roesler) introduced me to the “envisaged worlds” that I mentioned in part one of this series. Woods expanded on this in “Behind Human Error” (chapter six), co-written with Sidney Dekker, Richard Cook, Leila Johannesen and Nadine Sarter.

These mental models are potentially dangerous, as I explained in part one. They are invariably oversimplified. They are partial, flawed and (to use the word favoured by Woods et al) they are “buggy”. But it is an oversimplification to dismiss them as useless because they are oversimplified; they are vitally important. We have to be aware of their limitations, and our own instinctive desire to make them too simple, but we need them to get anywhere when we work with complex systems.

Without these mental models we would be left bemused and helpless when confronted with deep complexity. Rather than thinking of these models as attempts to form precise representations of systems it is far more useful to treat them as heuristics, which are (as defined by James Bach, I think), a useful but fallible way to solve a problem or make a decision.

David Woods is a member of Snafucatchers, which describes itself as “a consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.”

Snafucatchers produced an important report in 2017, “STELLA – Report from the SNAFUcatchers Workshop on Coping With Complexity”. The workshop and report looked at how experts respond to anomalies and failures with complex systems. It’s well worth reading and reflecting on. The report discusses mental models and adds an interesting refinement, the line of representation.
Above the line of representation we have the parts of the overall system that are visible; the people, their actions and interactions. The line itself has the facilities and tools that allow us to monitor and manage what is going on below the line. We build our mental models of how the system is working and use the information from the screens we see, and the controls available to us to operate the system. However, what we see and manipulate is not the system itself.

There is a mass of artifacts under the line that we can never directly see working. We see only the representation that is available to us at the level of the line. Everything else is out of sight and the representations that are available to us offer us only the chance to peer through a keyhole as we try to make sense of the system below. There has always been a large and invisible substructure in complex IT systems that was barely visible or understood. With internet systems this has grown enormously.

The green line is the line of representation. It is composed of terminal display screens, keyboards, mice, trackpads, and other interfaces. The software and hardware (collectively, the technical artifacts) running below the line cannot be seen or controlled directly. Instead, every interaction crossing the line is mediated by a representation. This is true as well for people in the using world who interact via representations on their computer screens and send keystrokes and mouse movements.

A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System.

And those models of the system are based on the partial representation that is visible to us above the line.

An important consequence of this is that people interacting with the system are critically dependent on their mental models of that system – models that are sure to be incomplete, buggy (see Woods et al above, “Behind Human Error”), and quickly become stale. When a technical system surprises us, it is most often because our mental models of that system are flawed.

This has important implications for teams working with complex systems. The system is constantly adapting and evolving. The mental models that people use must also constantly be revised and refined if they are to remain useful. Each of these individual models represents the reality that each operator understands. All the models are different, but all are equally valid, as ontological relativism tells us. As each team member has a different, valid model it is important that they work together closely, sharing their models so they can co-operate effectively.

This is a world in which traditional corporate bureaucracy with clear, fixed lines of command and control, with detailed and prescriptive processes, is redundant. It offers little of value – only an illusion of control for those at the top, and it hinders the people who are doing the most valuable work (see “part 1 – corporate bureaucracies”).

For those who work with complex, sociotechnical systems the flexibility, the co-operative teamwork, the swifter movement and, above all, the realism of Devops offer greater promise. My experience with deeply complex systems has persuaded me that such an approach is worthwhile. But just as these complex systems will constantly change so must the way we respond. There is no magic, definitive solution that will always work for us. We will always have to adapt, to learn and change if we are to remain relevant.

It is important that developers and testers pay close attention to the work of people like the Snafucatchers. They are offering the insights, the evidence and the arguments that will help us to adapt in a world that will never stop adapting.

In the final part of this series, part 9 “Learning to live with the unknowable” I will try to draw all these strands together and present some thoughts about the future of testing as we are increasingly confronted with complex systems that are beyond our ability to comprehend.

February 1, 2019 by James Christie

The dragons of the unknown; part 7 – resilience requires people

Introduction

This is the seventh post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

This post is about the importance of system resilience and the vital role that people play in keeping systems going.

Robustness versus resilience

The idea of resilience is where Safety II and Cynefin come together in a practical way for software development. Safety critical professionals have become closely involved in the field of resilience engineering. Dave Snowden, Cynefin’s creator, places great emphasis on the need for systems in complex environments to be resilient.

First, I’d like to make an important distinction, between robustness and resilience. The example Snowden uses is that a seawall is robust but a salt marsh is resilient. A seawall is a barrier to large waves and storms. It protects the land behind, but if it fails it does so catastrophically. A salt marsh protects inland areas by acting as a buffer, absorbing storm waves rather than repelling them. It might deteriorate over time but it won’t fail suddenly and disastrously.

Designing for robustness entails trying to prevent failure. Designing for resilience recognises that failure is inevitable in some form but tries to make that failure containable. Resilience means that recovery should be swift and relatively easy when it does occur, and crucially, it means that failure can be detected quickly, or even in advance so that operators have a chance to respond.

What struck me about the resilience engineering approach is that it matches the way that we managed the highly complex insurance financial applications I mentioned in “part 2 – crucial features of complex systems”. We had never heard of resilience engineering, but the site standards were of limited use. We had to feel our way, finding an appropriate response as the rapid pace of development created terrifying new complexity on top of a raft of ancient legacy applications.

The need for efficient processing of the massive batch runs had to be balanced against the need to detect the constant flow of small failures early, to stop them turning into major problems, and also against the pressing need to facilitate recovery when we inevitably hit serious failure. We also had to think about what “failure” really meant in a context where 100% (or even 98%) accuracy was an unrealistic dream that would distract us from providing flawed but valuable systems to our users within the timescales that were dictated by commercial pressures.

An increasing challenge for testers will be to look for information about how systems fail, and test for resilience rather than robustness. Liz Keogh, in this talk on “Safe-to-Fail” makes a similar point.

“Testers are really, really good at spotting failure scenarios… they are awesomely imaginative at calamity… Devs are problem solvers. They spot patterns. Testers spot holes in patterns… I have a theory that other people who are in critical positions, like compliance and governance people are also really good at this.”

Developing for resilence means that tolerance for failure becomes more important than a vain attempt to prevent failure altogether. This tolerance often requires greater redundancy. Stripping out redundancy and maximizing the efficiency of systems has a downside. Greater efficiency can make applications brittle and inflexible. When problems hit they hit hard and recovery can be difficult.

However, redundancy itself adds to the complexity of systems and can create unexpected ways for them to fail. In our massively complex insurance finance systems a constant threat was that the safeguards we introduced to make the systems resilient might result in the processing runs failing to complete in time and disrupting other essential applications.

The ETTO principle (see part 6 , “Safety II – learning from what goes right”) describes the dilemmas we were constantly having to deal with. But the problems we faced were more complex than a simple trade off, sacrificing efficiency would not necessarily lead to greater effectiveness. Poorly thought out safeguards could harm both efficiency and effectiveness.

We had to nurse those systems carefully. That is a crucial idea to understand. Complex systems require constant attention by skilled people and these people are an indispensable means of controlling the systems.

Ashby’s Law of Requisite Variety

Ashby’s Law of Requisite Variety is also known as The First Law of Cybernetics.

“The complexity of a control system must be equal to or greater than the complexity of the system it controls.”

A stable system needs as much variety in the control mechanisms as there is in the system itself. This does not mean as much variety as the external reality that the system is attempting to manage – a thermostat is just on or off, it isn’t directly controlling the temperature, just whether the heating is on or off.

The implication for complex socio-technical systems is that the controlling mechanism must include humans if it is to be stable precisely because the system includes humans. The control mechanism has to be as complex and sophisticated as the system itself. It’s one of those “laws” that looks trivially obvious when it is unpacked, but whose implications can easily be missed unless we turn our minds to the problem and its implications. We must therefore trust expertise, trust the expert operators, and learn what they have to do to keep the system running.

I like the analogy of an orchestra’s conductor. It’s a flawed analogy (all models are flawed, though some are useful). The point is that you need a flexible, experienced human to make sense of the complexity and constantly adjust the system to keep it working and useful.

Really know the users

I have learned that it is crucially important to build a deep understanding of the user representatives and the world they work in. This is often not possible, but when I have been able to do it the effort has always paid off. If you can find good contacts in the user community you can learn a huge amount from them. Respect deep expertise and try to acquire it yourself if possible.

When I moved into the world of insurance finance systems I had very bright, enthusiastic, young (but experienced) users who took the time to immerse me in their world. I was responsible for the development, not just the testing. The users wanted me to understand them, their motivation, the pressures on them, where they wanted to get to, the risks they worried about, what kept them awake at night. It wasn’t about record-keeping. It was all about understanding risks and exposures. They wanted to set prices accurately, to compete aggressively using cheap prices for good risks and high prices for the poor risks.

That much was obvious, but I hadn’t understood the deep technical problems and complexities of unpacking the risk factors and the associated profits and losses. Understanding those problems and the concerns of my users was essential to delivering something valuable. The time spent learning from them allowed me to understand not only why imperfection was acceptable and chasing perfection was dangerous, but also what sort of imperfection was acceptable.

Building good, lasting relations with my users was perhaps the best thing I ever did for my employers and it paid huge dividends over the next few years.

We shouldn’t be thinking only about the deep domain experts though. It’s also vital to look at what happens at the sharp end with operational users, perhaps lowly and stressed, carrying out the daily routine. If we don’t understand these users, the pressures and distractions they face, and how they have to respond then we don’t understand the system that matters, the wider complex, socio-technical system.

Testers should be trying to learn more from experts working on human factors and ergonomics and user experience. I’ll finish this section with just a couple of examples of the level of thought and detail that such experts put into the design of aeroplane cockpits.

Boeing is extremely concerned about the danger of overloading cockpit crew with so much information that they pay insufficient attention to the most urgent warnings. The designers therefore only use the colour red in cockpits when the pilot has to take urgent action to keep the plane safe. Red appears only for events like engine fires and worse. Less urgent alerts use other colours and are less dramatic. Pilots know that if they ever see a red light or red text then they have to act. [The original version of this was written before the Boeing 737 MAX crashes. These have raised concerns about Boeing’s practices that I will return to later.]

A second and less obvious example of the level of detailed thought that goes into flight deck designs is that analog speed dials are widely considered safer than digital displays. Pilots can glance at the dial and see that the airspeed is in the right zone given all the other factors (e.g. height, weight and distance to landing) at the same time as they are processing a blizzard of other information.

A digital display isn’t as valuable. (See Edwin Hutchins’ “How a cockpit remembers its speeds“, Cognitive Science 19, 1995.) It might offer more precise information, but it is less useful to pilots when they really need to know about the aircraft’s speed during a landing, a time when they have to deal with many other demands. In a highly complex environment it is more important to be useful than accurate. Safe is more important than precise.

The speed dial that I have used as an illustration is also a good example both of beneficial user variations and of the perils of piling in extra features. The tabs surrounding the dial are known as speed bugs. Originally pilots improvised with grease pencils or tape to mark the higher and lower limits of the airspeed that would be safe for landing that flight. Designers picked up on that and added movable plastic tabs. Unfortunately, they went too far and added tabs for every eventuality, thus bringing visual clutter into what had been a simple solution. (See Donald Norman’s “Turn signals are the facial expressions of automobiles“, chapter 16, “Coffee cups in the cockpit”, Basic Books, 1993.)

We need a better understanding of what will help people make the system work, and what is likely to trip them up. That entails respect for the users and their expertise. We must not only trust them we must never lose our own humility about what we can realistically know.

As Jens Rasmussen put it (in a much quoted talk at the IEEE Standards Workshop on Human Factors and Nuclear Safety in 1981 – I have not been able to track this down).

“The operator’s role is to make up for holes in designers’ work.”

Testers should be ready to explore and try to explain these holes, the gap between the designers’ limited knowledge and the reality that the users will deal with. We have to try to think about what the system as found will be like. We must not restrict ourselves to the system as imagined.

Lessons from resilience engineering

There is a huge amount to learn from resilience engineering. This community has a significant overlap with the safety critical community. The resilience engineering literature is vast and growing. However, for a quick flavour of what might be useful for testers it’s worth looking at the four principles of Erik Hollnagel’s Functional Resonance Analysis Method (FRAM). FRAM tries to provide a way to model complex socio-technical systems so that we can gain a better understanding of likely outcomes.

- Success and failure are equivalent. They can happen in similar ways.
  It is dangerously misleading to assume that the system is bimodal, that it is either working or broken. Any factor that is present in a failure can equally be present in success.

- Success, failure and normal outcomes are all emergent qualities of the whole system.
  We cannot learn about what will happen in a complex system by observing only the individual components.

- People must constantly make small adjustments to keep the system running.
  These changes are both essential for the successful operation of the system, but also a contributory cause of failure. Changes are usually approximate adjustments, based on experience, rather than precise, calculated changes. An intrinsic feature of complex systems is that small changes can have a dramatic effect on the overall system. A change to one variable or function will always affect others.

- “Functional resonance” is the detectable result of unexpected interaction of normal variations.

Functional resonance is a particularly interesting concept. Resonance is the engineering term for the effect we get when different things vibrate with the same frequency. If an object is struck or moved suddenly it will vibrate at its natural frequency. If the object producing the force is also vibrating at the same frequency the result is resonance, and the effect of the impact can be amplified dramatically.

Resonance is the effect you see if you push a child on a swing. If your pushes match the motion of the swing you quickly amplify the motion. If your timing is wrong you dampen the swing’s motion. Resonance can produce unpredictable results. A famous example is the danger that marching troops can bring a bridge down if the rhythm of their marching coincides with the natural frequency at which the bridge vibrates.

Learning about functional resonance means learning about the way that different variables combine to amplify or dampen the effect that each has, producing outcomes that would have been entirely unpredictable from looking at their behaviour individually.

Small changes can lead to drastically different outcomes at different times depending on what else is happening. The different variables in the system will be coupled in potentially significant ways the designers did not understand. These variables can reinforce, or play off each other, unpredictably.

Safety is a control problem – a matter of controlling these interactions, which means we have to understand them first. But, as we have seen, the answer can’t be to keep adding controls to try and achieve greater safety. Safety is not only a control problem, it is also an emergent and therefore unpredictable property (see appendix). That’s not a comfortable combination for the safety critical community.

Although it is impossible to predict emergent behaviour in a complex system it is possible to learn about the sort of impact that changes and user actions might have. FRAM is not a model for testers. However, it does provide a useful illustration of the approach being taken by safety experts who are desperate to learn and gain a better understanding of how systems might work.

Good testers are surely well placed to reach out and offer their skills and experience. It is, after all, the job of testers to learn about systems and tell a “compelling story” (as Messrs Bach & Bolton put it) to the people who need to know. They need the feedback that we can provide, but if it is to be useful we all have to accept that it cannot be exact.

Lotfi Zadeh, a US mathematician, computer scientist and engineer introduced the idea of fuzzy logic. He made this deeply insightful observation, quoted in Daniel McNeill and Paul Freiberger’s book “Fuzzy Logic”.

“As complexity rises, precise statements lose meaning, and meaningful statements lose precision.”

Zadeh’s maxim has come to be known as the Law of Incompatibility. If we are dealing with complex socio-technical systems we can be meaningful or we can be precise. We cannot be both; they are incompatible in such a context. It might be hard to admit we can say nothing with certainty, but the truth is that meaningful statements cannot be precise. If we say “yes, we know” then we are misleading the people who are looking for guidance. To pretend otherwise is bullshitting.

In the eighth post of this series, “How we look at complex systems”, I will talk about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

Appendix – is safety an emergent property?

In this series I have repeatedly referred to safety as being an emergent property of complex adaptive systems. For beginners trying to get their heads round this subject it is an important point to take on board.

However, the nature of safety is rather more nuanced. Erik Hollnagel argues that safety is a state of the whole system, rather than one of the system’s properties. Further, we consciously work towards that state of safety, trying to manipulate the system to achieve the desired state. Therefore safety is not emergent; it is a resultant state, a deliberate result. On the other hand, a lack of safety is an emergent property because it arises from unpredictable and undesirable adaptions of the system and its users.

Other safety experts differ and regard safety as being emergent. For the purpose of this blog I will stick with the idea that it is emergent. However, it is worth bearing Hollnagel’s argument in mind. I am quite happy to think of safety being a state of a system because my training and experience lead me to think of states as being more transitory than properties, but I don’t feel sufficiently strongly to stop referring to safety as being an emergent property.

October 25, 2018 by James Christie

The dragons of the unknown; part 6 – Safety II, a new way of looking at safety

Introduction

This is the sixth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

The fourth post, “part 4 – a brief history of accident models”, looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “part 5 – accident investigations and treating people fairly”, looks at weaknesses of the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

This post looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

More safety means less feedback

In 2017 nobody was killed on a scheduled passenger flight (sadly that won’t be the case in 2018). That prompted the South China Morning Post to produce this striking graphic, which I’ve reproduced here in butchered form. Please, please look at the original. My version is just a crude taster.

Increasing safety is obviously good news, but it poses a problem for safety professionals. If you rely on accidents for feedback then reducing accidents will choke off the feedback you need to keep improving, to keep safe. The safer that systems become the less data is available. Remember what William Langewiesche said (see part 4).

“What can go wrong usually goes right – and then people draw the wrong conclusions.”

If accidents have become rare, but are extremely serious when they do occur, then it will be highly counter-productive if investigators pick out people’s actions that deviated from, or adapted, the procedures that management or designers assumed were being followed.

These deviations are always present in complex socio-technical systems that are running successfully and it is misleading to focus on them as if they were a necessary and sufficient cause when there is an accident. The deviations may have been a necessary cause of that particular accident, but in a complex system they were almost certainly not sufficient. These very deviations may have previously ensured the overall system would work. Removing the deviation will not necessarily make the system safer.

There might be fewer opportunities to learn from things going wrong, but there’s a huge amount to learn from all the cases that go right, provided we look. We need to try and understand the patterns, the constraints and the factors that are likely to amplify desired emergent behaviour and those that will dampen the undesirable or dangerous. In order to create a better understanding of how complex socio-technical systems can work safely we have to look at how people are using them when everything works, not just when there are accidents.

Safety II – learning from what goes right

Complex systems and accidents might be beyond our comprehension but that doesn’t mean we should just accept that “shit happens”. That is too flippant and fatalistic, two words that you can never apply to the safety critical people.

Safety I is shorthand for the old safety world view, which focused on failure. Its utility has been hindered by the relative lack of feedback from things going wrong, and the danger that paying insufficient attention to how and why things normally go right will lead to the wrong lessons being learned from the failures that do occur.

Safety I assumed linear cause and effect with root causes (see part 5). It was therefore prone to reaching a dangerously simplistic verdict of human error.

This diagram illustrates the focus of Safety I on the unusual, on the bad outcomes. I have copied, and slightly adapted, the Safety I and Safety II diagrams from a document produced by Eurocontrol, (The European Organisation for the Safety of Air Navigation) “From Safety-I to Safety-II: A White Paper” (PDF, opens in new tab).

Incidentally, I don’t know why Safety I and Safety II are routinely illustrated using a normal distribution with the Safety I focus kicking in at two standard deviations. I haven’t been able to find a satisfactory explanation for that. I assume that this is simply for illustrative purposes.

If Safety I wants to prevent bad outcomes, in contrast Safety II looks at how good outcomes are reached. Safety II is rooted in a more realistic understand of complex systems than Safety I and extends the focus to what goes right in systems. That entails a detailed examination of what people are doing with the system in the real world to keep it running. Instead of people being regarded as a weakness and a source of danger, Safety II assumes that people, and the adaptations they introduce to systems and processes, are the very reasons we usually get good, safe outcomes.

If we’ve been involved in the development of the system we might think that we have a good understanding of how the system should be working, but users will always, and rightly, be introducing variations that designers and testers had never envisaged. The old, Safety I, way of thinking regarded these variations as mistakes, but they are needed to keep the systems safe and efficient. We expect systems to be both, which leads on to the next point.

There’s a principle in safety critical systems called ETTO, the Efficiency Thoroughness Trade Off. It was devised by Erik Hollnagel, though it might be more accurate to say he made it explicit and popularised the idea. The idea should be very familiar to people who have worked with complex systems. Hollnagel argues that it is impossible to maximise both efficiency and thoroughness. I’m usually reluctant to cite Wikipedia as a source, but its article on ETTO explains it more succinctly than Hollnagel himself did.

“There is a trade-off between efficiency or effectiveness on one hand, and thoroughness (such as safety assurance and human reliability) on the other. In accordance with this principle, demands for productivity tend to reduce thoroughness while demands for safety reduce efficiency.”

Making the system more efficient makes it less likely that it will achieve its important goals. Chasing these goals comes at the expense of efficiency. That has huge implications for safety critical systems. Safety requires some redundancy, duplication and fallbacks. These are inefficient. Efficiencies eliminate margins of error, with potentially dangerous results.

ETTO recognises the tension between organisations’ need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. In practice, the conflict in goals is usually fully resolved only at the sharp end, where people do the real work and run the systems.

airline job ad As an example, an airline might offer a punctuality bonus to staff. For an airline safety obviously has the highest priority, but if it was an absolute priority, the only consideration, then it could not contemplate any incentive that would encourage crews to speed up turnarounds on the ground, or to persevere with a landing when prudence would dictate a “go around”. In truth, if safety were an absolute priority, with no element of risk being tolerated, would planes ever take off?

People are under pressure to make the systems efficient, but they are expected to keep the system safe, which inevitably introduces inefficiencies. This tension results in a constant, shifting, pattern of trade-offs and compromises. The danger, as “drift into failure” predicts (see part 4), is that this can lead to a gradual erosion of safety margins.

The old view of safety was to constrain people, reducing variability in the way they use systems. Variability was a human weakness. In Safety II variability in the way that people use the system is seen as a way to ensure the system adapts to stay effective. Humans aren’t seen as a potential weakness, but as a source of flexibility and resilience. Instead of saying “they didn’t follow the set process therefore that caused the accident”, the Safety II approach means asking “why would that have seemed like the right thing to do at the time? Was that normally a safe action?”. Investigations need to learn through asking questions, not making judgments – a lesson it was vital I learned as an inexperienced auditor.

Emergence means that the behaviour of a complex system can’t be predicted from the behaviour of its components. Testers therefore have to think very carefully about when we should apply simple pass or fail criteria. The safety critical community explicitly reject the idea of pass/fail, or the bimodal principle as they call it (see part 4). A flawed component can still be useful. A component working exactly as the designers, and even the users, intended can still contribute to disaster. It all depends on the context, what is happening elsewhere in the system, and testers need to explore the relationships between components and try to learn how people will respond.

Safety is an emergent property of the system. It’s not possible to design it into a system, to build it, or implement it. The system’s rules, controls, and constraints might prevent safety emerging, but they can only enable it. They can create the potential for people to keep the system safe but they cannot guarantee it. Safety depends on user responses and adaptations.

Adaptation means the system is constantly changing as the problem changes, as the environment changes, and as the operators respond to change with their own variations. People manage safety with countless small adjustments.

we don't make mistakes There is a popular internet meme, “we don’t make mistakes – we do variations”. It is particularly relevant to the safety critical community, who have picked up on it because it neatly encapsulates their thinking, e.g. this article by Steven Shorrock, “Safety-II and Just Culture: Where Now?”. Shorrock, in line with others in the safety critical community, argues that if the corporate culture is to be just and treat people fairly then it is important that the variations that users introduce are understood, rather than being used as evidence to punish them when there is an accident. Pinning the blame on people is not only an abdication of responsibility, it is unjust. As I’ve already argued (see part 5), it’s an ethical issue.

Operator adjustments are vital to keep systems working and safe, which brings us to the idea of trust. A well-designed system has to trust the users to adapt appropriately as the problem changes. The designers and testers can’t know the problems the users will face in the wild. They have to confront the fact that dangerous dragons are lurking in the unknown, and the system has to trust the users with the freedom to stay in the safe zone, clear of the dragons, and out of the disastrous tail of the bell curve that illustrates Safety II.

Safety II and Cynefin

If you’re familiar with Cynefin then you might wonder about Safety II moving away from a focus on the tail of the distribution. Cynefin helps us understand that the tail is where we can find opportunities as well as threats. It’s worth stressing that Safety II does encompass Safety I and the dangerous tail of the distribution. It must not be a binary choice of focusing on either the tail or the body. We have to try to understand not only what happens in the tail, how people and systems can inadvertently end up there, but also what operators do to keep out of the tail.

The Cynefin framework and Safety II share a similar perspective on complexity and the need to allow for, and encourage, variation. I have written about Cynefin elsewhere, e.g. in two articles I wrote for the Association for Software Testing, and there isn’t room to repeat that here. However, I do strongly recommend that testers familiarise themselves with the framework.

To sum it up very briefly, Cynefin helps us to make sense of problems by assigning them to one of four different categories, the obvious, the complicated (the obvious and complicated being related in that problems have predictable causes and resolutions), the complex and the chaotic. Depending on the category different approaches are required. In the case of software development the challenge is to learn more about the problem in order to turn it from a complex activity into a complicated one that we can manage more easily.

Applying Cynefin would result in more emphasis on what’s happening in the tails of the distribution, because that’s where we will find the threats to be avoided and the opportunities to be exploited. Nevertheless, Cynefin isn’t like the old Safety I just because they both focus on the tails. They embody totally different worldviews.

Safety II is an alternative way of looking at accidents, failure and safety. It is not THE definitive way, that renders all others dated, false and heretical. The Safety I approach still has its place, but it’s important to remember its limitations.

Everything flows and nothing abides

Thinking about linear cause and effect, and decomposing components are still vital in helping us understand how different parts of the system work, but they offer only a very limited and incomplete view of what we should be trying to learn. They provide a way of starting to build our understanding, but we mustn’t stop there.

We also have to venture out into the realms of the unknown and often unknowable, to try to understand more about what might happen when the components combine with each other and with humans in complex socio-technical systems. This is when objects become processes, when static elements become part of a flow that is apparent only when we zoom out to take in a bigger picture in time and space.

The idea of understanding objects by stepping back and looking at how they flow and mutate over time has a long, philosophical and scientific history. 2,500 years ago Heraclitus wrote.

“Everything flows and nothing abides. Everything gives way and nothing stays fixed.”

Professor Michael McIntyre (Professor of Atmospheric Dynamics, Cambridge University) put it well in a fascinating BBC documentary, “The secret life of waves”.

“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”

In my next post “part 7 – resilience requires people” I will discuss some of the implications for software testing of the issues I have raised here, in particular how people keep systems going, and dealing with the inevitability of failure. That will lead us to resilience engineering. everything flows

October 21, 2018 by James Christie

The dragons of the unknown; part 5 – accident investigations and treating people fairly

Introduction

This is the fifth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The fourth post “part 4 – a brief history of accident models” looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them. This post looks at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

The limitations of root cause analysis

Once you accept that complex systems can’t have clear and neat links between causes and effects then the idea of root cause analysis becomes impossible to sustain. “Fishbone” cause and effect diagrams (like those used in Six Sigma) illustrate traditional thinking, that it is possible to track back from an adverse event to find a root cause that was both necessary and sufficient to bring it about.

The assumption of linearity with tidy causes and effects is no more than wishful thinking. Like the Domino Model (see “part 4 – a brief history of accident models”) it encourages people to think there is a single cause, and to stop looking when they’ve found it. It doesn’t even offer the insight of the Swiss Cheese Model (also see part 4) that there can be multiple contributory causes, all of them necessary but none of them sufficient to produce an accident. That is a key idea. When complex systems go wrong there is rarely a single cause; causes are necessary, but not sufficient.

Here is a more realistic depiction of what a complex socio-technical system. It is a representation of the operations control system for an airline. The specifics don’t matter. It is simply a good illustration of how messy a real, complex system looks when we try to depict it.

This is actually very similar to the insurance finance applications diagram I drew up for Y2K (see “part 1 – corporate bureaucracies”). There was no neat linearity. My diagram looked just like this, with a similar number of nodes, or systems most of which had multiple two-way interfaces with others. And that was just at the level of applications. There was some intimidating complexity within these systems.

As there is no single cause of failure the search for a root cause can be counter-productive. There are always flaws, bugs, problems, deviances from process, variations. So you can always fix on something that has gone wrong. But it’s not really a meaningful single cause. It’s arbitrary.

The root cause is just where you decide to stop looking. The cause is not something you discover. It’s something you choose and construct. The search for a root cause can mean attention will focus on something that is not inherently dangerous, something that had previously “failed” repeatedly but without any accident. The response might prevent that particular failure and therefore ensure there’s no recurrence of an identical accident. However, introducing a change, even if it’s a fix, to one part of a complex system affects the system in unpredictable ways. The change therefore creates new possibilities for failure that are unknown, even unknowable.

It’s always been hard, even counter-intuitive, to accept that we can have accidents & disasters without any new failure of a component, or even without any technical failure that investigators can identify and without external factors interfering with the system and its operators. We can still have air crashes for which no cause is ever found. The pressure to find an answer, any plausible answer, means there has always been an overwhelming temptation to fix the blame on people, on human error.

Human error – it’s the result of a problem, not the cause

If there’s an accident you can always find someone who screwed up, or who didn’t follow the rules, the standard, or the official process. One problem with that is the same applies when everything goes well. Something that troubled me in audit was realising that every project had problems, every application had bugs when it went live, and there were always deviations from the standards. But the reason smart people were deviating wasn’t that they were irresponsible. They were doing what they had to do to deliver the project. Variation was a sign of success as much as failure. Beating people up didn’t tell us anything useful, and it was appallingly unfair.

One of the rewarding aspects of working as an IT auditor was conducting post-implementation reviews and being able to defend developers who were being blamed unfairly for problem projects. The business would give them impossible jobs, complacently assuming the developers would pick up all the flak for the inevitable problems. When auditors, like me, called them out for being cynical and irresponsible they hated it. They used to say it was because I had a developer background and was angling for my next job. I didn’t care because I was right. Working in a good audit department requires you to build up a thick skin, and some healthy arrogance.

There always was some deviation from standards, and the tougher the challenge the more obvious they would be, but these allegedly deviant developers were the only reason anything was delivered at all, albeit by cutting a few corners.

It’s an ethical issue. Saying the cause of an accident is that people screwed up is opting for an easy answer that doesn’t offer any useful insights for the future and just pushes problems down the line.

Sidney Dekker used a colourful analogy. Dumping the blame on an individual after an accident is “peeing in your pants management” (PDF, opens in new tab).

“You feel relieved, but only for a short while… you start to feel cold and clammy and nasty. And you start stinking. And, oh by the way, you look like a fool.”

Putting the blame on human error doesn’t just stink. It obscures the deeper reasons for failure. It is the result of a problem, not the cause. It also encourages organisations to push for greater automation, in the vain hope that will produce greater safety and predictability, and fewer accidents.

The ironies of automation

An important part of the motivation to automate systems is that humans are seen as unreliable and inefficient. So they are replaced by automation, but that leaves the humans with jobs that are even more complex and even more vulnerable to errors. The attempt to remove errors creates fresh possibilities for even worse errors. As Lisanne Bainbridge wrote in a 1983 paper “The ironies of automation” (PDF, opens in new tab);

“The more advanced a control system is… the more crucial may be the contribution of the human operator.”

There are all sorts of twists to this. Automation can mean the technology does all the work and operators have to watch a machine that’s in a steady-state, with nothing to respond to. That means they can lose attention & not intervene when they need to. If intervention is required the danger is that vital alerts will be lost if the system is throwing too much information at operators. There is a difficult balance to be struck between denying operators feedback, and thus lulling them into a sense that everything is fine, and swamping them with information. Further, if the technology is doing deeply complicated processing, are the operators really equipped to intervene? Will the system allow operators to override? Bainbridge makes the further point;

“The designer who tries to eliminate the operator still leaves the operator to do the tasks which the designer cannot think how to automate.”

This is a vital point. Systems are becoming more complex and the tasks left to the humans become ever more demanding. System designers have only a very limited understanding of what people will do with their systems. They don’t know. The only certainty is that people will respond and do things that are hard, or impossible, to predict. That is bound to deviate from formal processes, which have been defined in advance, but these deviations, or variations, will be necessary to make the systems work.

Acting on the assumption that these deviations are necessarily errors and “the cause” when a complex socio-technical system fails is ethically wrong. However, there is a further twist to the problem, summed up by the Law of Stretched Systems.

Stretched systems

Lawrence Hirschhorn’s Law of Stretched Systems is similar to the Fundamental Law of Traffic Congestion. New roads create more demand to use them, so new roads generate more traffic. Likewise, improvements to systems result in demands that the system, and the people, must do more. Hirschhorn seems to have come up with the law informally, but it has been popularised by the safety critical community, especially by David Woods and Richard Cook.

“Every system operates always at its capacity. As soon as there is some improvement, some new technology, we stretch it.”

And the corollary, furnished by Woods and Cook.

“Under resource pressure, the benefits of change are taken in increased productivity, pushing the system back to the edge of the performance envelope.”

Every change and improvement merely adds to the stress that operators are coping with. The obvious response is to place more emphasis on ergonomics and human factors, to try and ensure that the systems are tailored to the users’ needs and as easy to use as possible. That might be important, but it hardly resolved the problem. These improvements are themselves subject to the Law of Stretched Systems.

This was all first noticed in the 1990s after the First Gulf War. The US Army hadn’t been in serious combat for 18 years. Technology had advanced massively. Throughout the 1980s the army reorganised, putting more emphasis on technology and training. The intention was that the technology should ease the strain on users, reduce fatigue and be as simple to operate as possible. It didn’t pan out that way when the new army went to war. Anthony H. Cordesman and Abraham Wagner analysed in depth the lessons of the conflict. They were particularly interested in how the technology had been used.

“Virtually every advance in ergonomics was exploited to ask military personnel to do more, do it faster, and do it in more complex ways… New tactics and technology simply result in altering the pattern of human stress to achieve a new intensity and tempo of combat.”

Improvements in technology create greater demands on the technology – and the people who operate it. Competitive pressures push companies towards the limits of the system. If you introduce an enhancement to ease the strain on users then managers, or senior officers, will insist on exploiting the change. Complex socio-technical systems always operate at the limits.

This applies not only to soldiers operating high tech equipment. It applies also to the ordinary infantry soldier. In 1860 the British army was worried that troops had to carry 27kg into combat (PDF, opens in new tab). The load has now risen to 58kg. US soldiers have to carry almost 9kg of batteries alone. The Taliban called NATO troops “donkeys”.

These issues don’t apply only to the military. They’ve prompted a huge amount of new thinking in safety critical industries, in particular healthcare and air transport.

The overdose – system behaviour is not explained by the behaviour of its component technology

Remember the traditional argument that any system that was not determimistic was inherently buggy and badly designed? See “part 2 – crucial features of complex systems”.

In reality that applies only to individual components, and even then complexity & thus bugginess can be inescapable. When you’re looking at the whole socio-technical system it just doesn’t stand up.

Introducing new controls, alerts and warnings doesn’t just increase the complexity of the technology as I mentioned earlier with the MIG jet designers (see part 4). These new features add to the burden on the people. Alerts and error message can swamp users of complex systems and they miss the information they really need to know.

I can’t recommend strongly enough the story told by Bob Wachter in “The overdose: harm in a wired hospital”.

A patient at a hospital in California received an overdose of 38½ times the correct amount. Investigation showed that the technology worked fine. All the individual systems and components performed as designed. They flagged up potential errors before they happened. So someone obviously screwed up. That would have been the traditional verdict. However, the hospital allowed Wachter to interview everyone involved in each of the steps. He observed how the systems were used in real conditions, not in a demonstration or test environment. Over five articles he told a compelling story that will force any fair reader to admit “yes, I’d have probably made the same error in those circumstances”.

Happily the patient survived the overdose. The hospital staff involved were not disciplined and were allowed to return to work. The hospital had to think long and hard about how it would try to prevent such mistakes recurring. The uncomfortable truth they had to confront was that there were no simple answers. Blaming human error was a cop out. Adding more alerts would compound the problems staff were already facing; one of the causes of the mistake was the volume of alerts swamping staff making it hard, or impossible, to sift out the vital warnings from the important and the merely useful.

One of the hard lessons was that focussing on making individual components more reliable had harmed the overall system. The story is an important illustration of the maxim in the safety critical community that trying to make systems safer can make them less safe.

Some system changes were required and made, but the hospital realised that the deeper problem was organisational and cultural. They made the brave decision to allow Wachter to publicise his investigation and his series of articles is well worth reading.

The response of the safety critical community to such problems and the necessary trade offs that a practical response requires, is intriguing with important lessons for software testers. I shall turn to this in my next post, “part 6 – Safety II, a new way of looking at safety”.

October 17, 2018 by James Christie

The dragons of the unknown; part 4 – a brief history of accident models

Introduction

This is the fourth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

This post looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

Why do accidents happen?

Taybridge_from_law_02SEP05 I want to take you back to the part of the world I come from, the east of Scotland. The Tay Bridge is 3.5 km long, the longest railway bridge in the United Kingdom. It’s the second railway bridge over the Tay. The first was opened in 1878 and came down in a storm in 1879, taking a train with it and killing everyone on board.

The stumps of the old bridge were left in place because of concern that removing them would disturb the riverbed. I always felt they were there as a lesson for later generations. Children in that part of Scotland can’t miss these stumps. I remember being bemused when I learned about the disaster. “Mummy, Daddy, what are those things beside the bridge? What…? Why…? How…?” So bridges could fall down. Adults could screw up. Things could go badly wrong. There might not be a happy ending. It was an important lesson in how the world worked for a small child.

Accident investigations are difficult and complex even for something like a bridge, which might appear, at first sight, to be straightforward in concept and function. The various factors that featured in the inquiry report for the Tay Bridge disaster included the bridge design, manufacture of components, construction, maintenance, previous damage, wind speed, train speed and the state of the riverbed.

These factors obviously had an impact on each other. That’s confusing enough, and it’s far worse for complex socio-technical systems. You could argue that a bridge is either safe or unsafe, usable or dangerous. It’s one or the other. There might be argument about where you would draw the line, and about the context in which the bridge might be safe, but most people would be comfortable with the idea of a line. Safety experts call that idea of a line separating the unbroken from the broken as the bimodal principle (not to be confused with Gartner’s bimodal IT management); a system is either working or it is broken. Tay Bridge 2.0 'pass'

Thinking in bimodal terms becomes pointless when you are working with systems that run in a constantly flawed state, one of partial failure, when no-one knows how these systems work or even exactly how they should work. This is all increasingly recognised. But when things go wrong and we try to make sense of them there is a huge temptation to arrange our thinking in a way with which we are comfortable, we fall back on mental models that seem to make sense of complexity, however far removed they are from reality. These are the envisioned worlds I mentioned in part 1.

We home in on factors that are politically convenient, the ones that will be most acceptable to the organisation’s culture. We can see this, and also how thinking has developed, by looking at the history of the conceptual models that have been used for accident investigations.

Heinrich’s Domino Model (1931)

Domino Model (fig 3) The Domino Model was a traditional and very influential way to help safety experts make sense of accidents. Accidents happened because one factor kicked into another, and so on down the line of dominos, as illustrated by Heinrich’s figure 3. Problems with the organisation or environment would lead to people making mistakes and doing something dangerous, which would lead to accidents and injury. It assumed neat linearity & causation. Its attraction was that it appealed to management. Take a look at the next two diagrams in the sequence, figures 4 and 5.

Domino Model (figs 4-5) The model explicitly states that taking out the unsafe act will stop an accident. It encouraged investigators to look for a single cause. That was immensely attractive because it kept attention away from any mistakes the management might have made in screwing up the organisation. The chain of collapsing dominos is broken by removing the unsafe act and you don’t get an accident.

The model is consistent with beating up the workers who do the real work. But blaming the workers was only part of the problem with the Domino Model. It was nonsense to think that you could stop accidents by removing unsafe acts, variations from process, or mistakes from the chain. It didn’t have any empirical, theoretical or scientific basis. It was completely inappropriate for complex systems. Thinking in terms of a chain of events was quite simply wrong when analysing these systems. Linearity, neat causation and decomposing problems into constituent parts for separate analysis don’t work.

Despite these fundamental flaws the Domino Model was popular. Or rather, it was popular because of its flaws. It told managers what they wanted to hear. It helped organisations make sense of something they would otherwise have been unable to understand. Accepting that they were dealing with incomprehensible systems was too much to ask.

Swiss Cheese Model (barriers)

James Reason’s Swiss Cheese Model was the next step and it was an advance, but limited. The model did recognise that problems or mistakes wouldn’t necessarily lead to an accident. That would happen only if a series of them lined up. You can therefore stop the accident recurring by inserting a new barrier. However, the model is still based on the idea of linearity and of a sequence of cause and effect, and also the idea that you can and should decompose problems for analysis. This is a dangerously limited way of looking at what goes wrong in complex socio-technical systems, and the danger is very real with safety critical systems.

Of course, there is nothing inherently wrong with analysing systems and problems by decomposing them or relying on an assumption of cause and effect. These both have impeccably respectable histories in science and philosophy. Reducing problems to their component parts has its intellectual roots in the work of Rene Descartes. This approach implies that you can understand the behaviour of a system by looking at the behaviour of its components. Descartes’ approach (the Cartesian) fits neatly with a Newtonian scientific worldview, which holds that it is possible to identify definite causes and effects for everything that happens.

If you want to understand how a machine works and why it is broken, then these approaches are obviously valid. They don’t work when you are dealing with a complex socio-technical system. The whole is quite different from the sum of its parts. Thinking of linear flows is misleading when the different elements of a system are constantly influencing each other and adapting to feedback. Complex systems have unpredictable, emergent properties, and safety is an emergent outcome of complex socio-technical systems.

All designs are a compromise

Something that safety experts are keenly aware of is that all designs are compromises. Adding a new barrier, as envisaged by the Swiss Cheese Model, to try and close off the possibility of an accident can be counter-productive. Introducing a change, even if it’s a fix, to one part of a complex system affects the whole system in unpredictable and possibly harmful ways. The change creates new possibilities for failure, that are unknown, even unknowable.

It’s not a question of regression testing. It’s bigger and deeper than that. The danger is that we create new pathways to failure. The changes might initially seem to work, to be safe, but they can have damaging results further down the line as the system adapts and evolves, as people push the system to the edges.

There’s a second danger. New alerts or controls increase the complexity with which the user has to cope. That was a problem I now recognise with our approach as auditors. We didn’t think through the implications carefully enough. If you keep throwing in fixes, controls and alerts then the user will miss the ones they really need to act on. That reduces the effectiveness, the quality and ultimately the safety of the system. I’ll come back to that later. This is an important paradox. Trying to make a system more reliable and safer can make it more dangerous and less reliable. MiG-29

The designers of the Soviet Union’s MiG-29 jet fighter observed, “the safest part is the one we could leave off”, (according to Sidney Dekker in his book “The field guide to understanding human error”).

Drift into failure

A friend once commented that she could always spot one of my designs. They reflected a very pessimistic view of the world. I couldn’t know how things would go wrong, I just knew they would and my experience had taught me where to be wary. Working in IT audit made me very cautious. Not only had I completely bought into Murphy’s Law, “anything that can go wrong will go wrong” I had my own variant; “and it will go wrong in ways I couldn’t have imagined”.

William Langewiesche is a writer and former commercial pilot who has written extensively on aviation. He provided an interesting and insightful correction to Murphy, and also to me (from his book “Inside the Sky”).

“What can go wrong usually goes right”.

There are two aspects to this. Firstly, as I have already discussed, complex socio-technical systems are always flawed. They run with problems, bugs, variations from official process, and in the way people behave. Despite all the problems under the surface everything seems to go fine, till one day it all goes wrong.

The second important insight is that you can have an accident even if no individual part of the system has gone wrong. Components may have always worked fine, and continue to do so, but on the day of disaster they combine in unpredictable ways to produce an accident.

Accidents can happen when all the components have been working as designed, not just when they fail. That’s a difficult lesson to learn. I’d go so far as to say we (in software development and engineering and even testing) didn’t want to learn it. However, that’s the reality, however scary it is.

Sidney Dekker developed this idea in a fascinating and important book, “Drift Into Failure”. His argument is that we are developing massively complex systems that we are incapable of understanding. It is therefore misguided to think in terms of system failure arising from a mistake by an operator or the sudden failure of part of the system.

“Drifting into failure is a gradual, incremental decline into disaster driven by environmental pressure, unruly technology and social processes that normalise growing risk. No organisation is exempt from drifting into failure. The reason is that routes to failure trace through the structures, processes and tasks that are necessary to make an organization successful. Failure does not come from the occasional, abnormal dysfunction or breakdown of these structures, processes or tasks, but is an inevitable by-product of their normal functioning. The same characteristics that guarantee the fulfillment of the organization’s mandate will turn out to be responsible for undermining that mandate…

In the drift into failure, accidents can happen without anything breaking, without anybody erring, without anyone violating rules they consider relevant.”

The idea of systems drifting into failure is a practical illustration of emergence in complex systems. The overall system adapts and changes over time, behaving in ways that could not have been predicted from analysis of the components. The fact that a system has operated safely and successfully in the past does not mean it will continue to do so. Dekker says;

“Empirical success… is no proof of safety. Past success does not guarantee future safety.”

Dekker’s argument about drifting to failure should strike a chord with anyone who has worked in large organisations. Complex systems are kept running by people who have to cope with unpredictable technology, in an environment that increasingly tolerates risk so long as disaster is averted. There is constant pressure to cut costs, to do more and do it faster. Margins are gradually shaved in tiny incremental changes, each of which seems harmless and easy to justify. The prevailing culture assumes that everything is safe, until suddenly it isn’t.

Langewiesche followed up his observation with a highly significant second point;

“What can go wrong usually goes right – and people just naturally draw the wrong conclusions.”

When it all does go wrong the danger is that we look for quick and simple answers. We focus on the components that we notice are flawed, without noticing all the times everything went right even with those flaws. We don’t think about the way people have been keeping systems running despite the problems, or how the culture has been taking them closer and closer to the edge. We then draw the wrong conclusions. Complex systems and the people operating them are constantly being pushed to the limits, which is an important idea that I will return to.

It is vital that we understand this idea of drift, and how people are constantly having to work with complex systems under pressure. Once we start to accept these ideas it starts to become clear that if we want to avoid drawing the wrong conclusions we have to be sceptical about traditional approaches to accident investigation. I’m talking specifically about root cause analysis, and the notion that “human error” is a meaningful explanation for accidents and problems. I will talk about these in my next post, “part 5 – accident investigations and treating people fairly”.