In defence of lobbying

Lobbying of politicians has acquired a dirty name, and that’s a pity. In fact, I find it rather irritating. Lobbying means campaigning, providing expert analysis and briefing to politicians, who are usually generalists and need that guidance from experts. Good politicians even come looking for it, to gain a better understanding of complex problems.

That’s when lobbying is being done well and responsibly. Some of my recent work counts as lobbying, related to the law of evidence in England and the Post Office Horizon scandal. I’m proud of that. It is a worthwhile activity and can help the political process work better.

Lobbying is quite different from schmoozing old pals to give other chums contracts without the tedious bureaucracy of open tenders. It’s different from offering politicians highly paid sinecures in the hope they’ll work on behalf of you rather than their constituents. It is certainly different from handing out bribes. Lobbying is not the same as corruption, and that is what we’ve been seeing lately in the UK.

Bugs are about more than code

Bugs are about more than code

Introduction

Recently I have had to think carefully about the nature of software systems, especially complex ones, and the bugs they contain. In doing so my thinking has been guided by certain beliefs I hold about complex software systems. These beliefs, or principles, are based on my practical experience but also on my studies, which, as well as teaching me much that I didn’t know, have helped me to make sense of what I have done and seen at work. Here are three vital principles I hold to be true.

Principle 1

Complex systems are not like calculators, which are predictable, deterministic instruments, i.e. they will always give the same answer from the same inputs. Complex systems are not predictable. We can only predict what they will probably do, but we cannot be certain. It is particularly important to remember this when working with complex socio-technical systems, i.e. complex systems, in the wider sense, that include humans, which are operated by people or require people to make them work. That covers most, or all, complex software systems.

Principle 2

Complex systems are more than the sum of their parts, or at least they are different. A system can be faulty even if all the individual programs,or components, are working correctly. The individual elements can combine with each other, and with the users, in unexpected and possibly damaging ways that could not have been predicted from inspecting the components separately.

Conversely, a system can be working satisfactorily even if some of the components are flawed. This inevitably means that the software code itself, however important it is, cannot be the only factor that determines the quality of the system.

Principle 3

Individual programs in a system can produce harmful outcomes even if their code was written perfectly. The outcome depends on how the different components, factors and people work together over the life of the system. Perfectly written code can cause a failure long after it` has been released when there are changes to the technical, legal, or commercial environment in which the system runs.

The consequences

Bugs in complex systems are therefore inevitable. The absence of bugs in the past does not mean they are absent now, and certainly not that the system will be bug free in the future. The challenge is partly to find bugs, learn from them, and help users to learn how they can use the system safely. But testers should also try to anticipate future bugs, how they might arise, where the system is vulnerable, and learn how and whether users and operators will be able to detect problems and respond. They must then have the communication skills to explain what they have found to the people who need to know.

What we must not do is start off from the assumption that particular elements of the system are reliable and that any problems must have their roots elsewhere in the system. That mindset tends to push blame towards the unfortunate people who operate a flawed system.

Bugs and the Post Office Horizon scandal

justice lost in the postOver the last few months I have spent a lot of time on issues raised by the Post Office Horizon scandal. For a detailed account of the scandal I strongly recommend the supplement that Private Eye has produced, written by Richard Brooks and Nick Wallis, “Justice lost in the post”.

When I have been researching this affair I have realised, time and again, how the Post Office and Fujitsu, the outsourced IT services supplier, ignored the three principles I outlined. While trawling through the judgment of Mr Justice Fraser in Bates v Post Office Ltd (No 6: Horizon Issues, i.e. the second of the two court cases brought by the Justice For Subpostmasters Alliance), which should be compulsory reading for Computer Science students, I was struck by the judge’s discussion of the nature of bugs in computer systems. You can find the full 313 page judgment here [PDF, opens in new tab].

The definition of a bug was at the heart of the second court case. The Post Office, and Fujitsu (the outsourced IT services supplier) argued that a bug is a coding error, and the word should not apply to other problems. The counsel for the claimants, i.e. the subpostmasters and subpostmistresses who had been victims of the flawed system, took a broader view; a bug is anything that means the software does not operate as users, or the corporation, expect.

After listening to both sides Fraser came down emphatically on the claimants’ side.

“26 The phrase ‘bugs, errors or defects’ is sufficiently wide to capture the many different faults or characteristics by which a computer system might not work correctly… Computer professionals will often refer simply to ‘code’, and a software bug can refer to errors within a system’s source code, but ‘software bugs’ has become more of a general term and is not restricted, in my judgment, to meaning an error or defect specifically within source code, or even code in an operating system.

Source code is not the only type of software used in a system, particularly in a complex system such as Horizon which uses numerous applications or programmes too. Reference data is part of the software of most modern systems, and this can be changed without the underlying code necessarily being changed. Indeed, that is one of the attractions of reference data. Software bug means something within a system that causes it to cause an incorrect or unexpected result. During Mr de Garr Robinson’s cross-examination of Mr Roll, he concentrated on ‘code’ very specifically and carefully [de Garr Robinson was the lawyer representing the Post Office and Roll was a witness for the claimants who gave evidence about problems with Horizon that he had seen when he worked for Fujitsu]. There is more to the criticisms levelled at Horizon by the claimants than complaints merely about bugs within the Horizon source code.

27 Bugs, errors or defects is not a phrase restricted solely to something contained in the source code, or any code. It includes, for example, data errors, data packet errors, data corruption, duplication of entries, errors in reference data and/or the operation of the system, as well as a very wide type of different problems or defects within the system. ‘Bugs, errors or defects’ is wide enough wording to include a wide range of alleged problems with the system.”

The determination of the Post Office and Fujitsu to limit the definition of bugs to source code was part of a policy of blaming users for all errors that were not obviously caused by the source code. This is clear from repeated comments from witnesses and from Mr Justice Fraser in the judgment. “User error” was the default explanation for all problems.

This stance was taken to an extreme with “phantom transactions”. These were transactions generated by the system but which were recorded as if they had been made by a user (see in particular paragraphs 209 to 214 of Fraser’s judgment).

In paragraph 212 Fraser refers to a Fujitsu problem report.

“However, the conclusion reached by Fujitsu and recorded in the PEAK was as follows:

‘Phantom transactions have not been proven in circumstances which preclude user error. In all cases where these have occurred a user error related cause can be attributed to the phenomenon.'”

This is striking. These phantom transactions had been observed by Royal Mail engineers. They were known to exist. But they were dismissed as a cause of problems unless it could be proven that user error was not responsible. If Fujitsu could imagine a scenario where user error might have been responsible for a problem they would rule out the possibility that a phantom transaction could have been the cause, even if the phantom had occurred. The PEAK (error report) would simply be closed off, whether or not the subpostmaster agreed.

This culture of blaming users rather than software was illustrated by a case of the system “working as designed” when its behaviour clearly confused and misled users. In fact the system was acting contrary to user commands. In certain circumstances if a user entered the details for a transaction, but did not commit it, the system would automatically complete the transaction with no further user intervention, which might result in a loss to the subpostmaster.

The Post Office, in a witness statement, described this as a “design quirk”. However, the Post Office’s barrister, Mr de Garr Robinson, in his cross-examination of Jason Coyne, an IT consultant hired by the subpostmasters, was able to convince Nick Wallis (one of the authors of “Justice lost in the post”) that there wasn’t a bug.

“Mr de Garr Robinson directs Mr Coyne to Angela van den Bogerd’s witness statement which notes this is a design quirk of Horizon. If a bunch of products sit in a basket for long enough on the screen Horizon will turn them into a sale automatically.

‘So this isn’t evidence of Horizon going wrong, is it?’ asks Mr de Garr Robinson. ‘It is an example of Horizon doing what it was supposed to do.’

‘It is evidence of the system doing something without the user choosing to do it.’ retorts Mr Coyne.

But that is not the point. It is not a bug in the system.”

Not a bug? I would contest that very strongly. If I were auditing a system with this “quirk” I would want to establish the reasons for the system’s behaviour. Was this feature deliberately designed into the system? Or was it an accidental by-product of the system design? Whatever the answer, it would be simply the start of a more detailed scrutiny of technical explanations, understanding of the nature of bugs, the reasons for a two-stage committal of data, and the reasons why those two stages were not always applied. I would not consider “working as designed” to be an acceptable answer.

The Post Office’s failure to grasp the nature of complex systems

A further revealing illustration of the Post Office’s attitude towards user error came in a witness statement provided for the Common Issues trial, the first of the two court cases brought by the Justice For Subpostmasters Alliance. This first trial was about the contractual relationship between the Post Office and subpostmasters. The statement came from Angela van den Bogerd. At the time she was People Services Director for the Post Office, but over the previous couple of decades she had been in senior technical positions, responsible for Horizon and its deployment. She described herself in court as “not an IT expert”. That is an interesting statement to consider alongside some of the comments in her witness statement.

“[78]… the Subpostmaster has complete control over the branch accounts and transactions only enter the branch accounts with the Subpostmaster’s (or his assistant’s) knowledge.

[92] I describe Horizon to new users as a big calculator. I think this captures the essence of the system in that it records the transactions inputted into it, and then adds or subtracts from the branch cash or stock holdings depending on whether it was a credit or debit transaction.”

“Complete control”? That confirms her admission that she is not an IT expert. I would never have been bold, or reckless, enough to claim that I was in complete control of any complex IT system for which I was responsible. The better I understood the system the less inclined I would be to make such a claim. Likening Horizon to a calculator is particularly revealing. See Principle 1 above. When I have tried to explain the nature of complex systems I have also used the calculator analogy, but as an illustration of what a complex system is not.

If a senior manager responsible for Horizon could use such a fundamentally mistaken analogy, and be prepared to insert it in a witness statement for a court case, it reveals how poorly equipped the Post Office management was to deal with the issues raised by Horizon. When we are confronted by complexity it is a natural reaction to try and construct a mental model that simplifies the problems and makes them understandable. This can be helpful. Indeed it is often essential if we are too make any sense of complexity. I have written about this here in my blog series “Dragons of the unknown”.

However, even the best models become dangerous if we lose sight of their limitations and start to think that they are exact representations of reality. They are no longer fallible aids to understanding, but become deeply deceptive.

If you think a computer system is like a calculator then you will approach problems with the wrong attitude. Calculators are completely reliable. Errors are invariably the result of users’ mistakes, “finger trouble”. That is exactly how senior Post Office managers, like Angela van den Bogerd, regarded the Horizon problems.

BugsZero

The Horizon scandal has implications for the argument that software developers can produce systems that have no bugs, that zero bugs is an attainable target. Arlo Belshee is a prominent exponent of this idea, of BugsZero as it is called. Here is a short introduction.

Before discussing anyone’s approach to bugs it is essential that we are clear what they mean by a bug. Belshee has a useful definition, which he provided in this talk in Singapore in 2016. (The conference website has a useful introduction to the talk.)

3:50 “The definition (of a bug) I use is anything that would frustrate, confuse or annoy a human and is potentially visible to a human other than the person who is currently actively writing (code).”

This definition is close to Justice Fraser’s (see above); “a bug is anything that means the software does not operate as users, or the corporation, expect”. However, I think that both definitions are limited.

BugsZero is a big topic, and I don’t have the time or expertise to do it justice, but for the purposes of this blog I’m happy to concede that it is possible for good coders to deliver exactly what they intend to, so that the code itself, within a particular program, will not act in ways that will “frustrate, confuse or annoy a human”, or at least a human who can detect the problem. That is the limitation of the definition. Not all faults with complex software will be detected. Some are not even detectable. Our inability to see them does not mean they are absent. Bugs can produce incorrect but plausible answers to calculations, or they can corrupt data, without users being able to see that a problem exists.

I speak from experience here. It might even be impossible for technical system experts to identify errors with confidence. It is not always possible to know whether a complex system is accurate. The insurance finance systems I used to work on were notoriously difficult to understand and manipulate. 100% accuracy was never a serious, practicable goal. As I wrote in “Fix on failure – a failure to understand failure”;

“With complex financial applications an honest and constructive answer to the question ‘is the application correct?’ would be some variant on ‘what do you mean by correct?’, or ‘I don’t know. It depends’. It might be possible to say the application is definitely not correct if it is producing obvious garbage. But the real difficulty is distinguishing between the seriously inaccurate, but plausible, and the acceptably inaccurate that is good enough to be useful. Discussion of accuracy requires understanding of critical assumptions, acceptable margins of error, confidence levels, the nature and availability of oracles, and the business context of the application.”

It is therefore misleading to define bugs as being potentially visible to users. Nevertheless, Belshee’s definition is useful provided that qualifiction is accepted. However, in the same talk, Belshee goes on to make further statements I do not accept.

19:55 “A bug is an encoded developer mistake.”

28:50 “A bug is a mistake by a developer.”

This is a developer-centric view of systems. It is understandable if developers focus on the bugs for which they are responsible. However, if you look at the systems, and bugs, from the other end, from the perspective of users when a bug has resulted in frustration, confusion or annoyance, the responsibility for the problem is irrelevant. The disappointed human is uninterested in whether the problem is with the coding, the design, the interaction of programs or components, or whatever. All that matters is that the system is buggy.

There is a further complication. The coder may well have delivered code that was perfect when it was written and released. But perfect code can create later problems if the world in which it operates changes. See Principle 3 above. This aspect of software is not sufficiently appreciated; it has caused me a great deal of trouble in my career (see the section “Across time, not just at a point in time” in this blog, about working with Big Data).

Belshee does say that developers should take responsibility for bugs that manifest themselves elswhere, even if their code was written correctly. He also makes it clear, when talking about fault tolerant systems (17:22 in the talk above), that faults can arise “when the behaviour of the world is not as we expect”.

However he also says that the system “needs to work exactly as the developer thought if it’s going to recover”. That’s too high a bar for complex socio-technical systems. The most anyone can say, and it’s an ambitious target, is that the programs have been developed exactly as the developers intended. Belshee is correct at the program level; if the programs were not built as the developers intended then recovery will be very much harder. But at the system level we need to be clear and outspoken about the limits of what we can know, and about the inevitability that bugs are present.

If we start to raise hopes that systems might be perfect and bug-free because we believe that we can deliver perfectly written code then we are setting ourselves up for unpleasant recriminations when our users suffer from bugs. It is certainly laudable to eradicate sloppy and cavalier coding and it might be noble for developers to be willing to assume responsibility for all bugs. But it could leave them exposed to legal recriminations if the wider world believes that software developers can and should` ensure systems are free of bugs. This is where lawyers might become involved and that is why I’m unhappy about the name BugsZero, and the undeliverable promise that it implies.

Unnoticed and undetectable bugs in the Horizon case

The reality that a bug might be real and damaging but not detected by users, or even detectable by them, was discussed in the Horizon case.

“[972] Did the Horizon IT system itself alert Subpostmasters of such bugs, errors or defects… and if so how?

[973] Answer: Although the experts were agreed that the extent to which any IT system can automatically alert its users to bugs within the system itself is necessarily limited, and although Horizon has automated checks which would detect certain bugs, they were also agreed that there are types of bugs which would not be detected by such checks. Indeed, the evidence showed that some bugs lay undiscovered in the Horizon system for years. This issue is very easy, therefore, to answer. The correct answer is very short. The answer… is ‘No, the Horizon system did not alert SPMs’. The second part of the issue does not therefore arise.”

That is a significant extract from an important judgment. A senior judge directly addressed the question of system reliability and pronounced that he is satisfied that a complex system cannot be expected to have adequate controls to warn users of all errors.

This is more than an abstract, philosophical debate about proof, evidence and what we can know. In England there is a legal presumption that computer evidence is reliable. This made a significant contribution to the Horizon scandal. Both parties in a court case are obliged to disclose documents which might either support or undermine their case, so that the other side has a chance to inspect and challenge them. The Post Office and Fujitsu did not disclose anything that would have cast doubt on their computer evidence. That failure to disclose meant it was impossible for the subpostmasters being prosecuted to challenge the presumption that the evidence was reliable. The subpostmasters didn’t know about the relevant system problems, and they didn’t even know that that knowledge had been withheld from them.

Replacing the presumption of computer reliability

There are two broad approaches that can be taken in response to the presumption that computer evidence is reliable and the ease with which it can be abused, apart of course from ignoring the problem and accepting that injustice is a price worth paying for judicial convenience. England can scrap the presumption, which would require the party seeking to use the evidence to justify its reliability. Or the rules over disclosure can be strengthened to try and ensure that all relevant information about systems is revealed. Some blend of the two approaches seems most likely.

I have recently contributed to a paper entitled “Recommendations for the probity of computer evidence”. It has been submitted to the Ministry of Justice, which is responsible for the courts in England & Wales, and is available from the Digital Evidence and Electronic Signature Law Review.

The paper argues that the presumption of computer reliability should be replaced by a two stage approach when reliability is challenged. The provider of the data should first be required to provide evidence to demonstrate that they are in control of their systems, that they record and track all bugs, fixes, changes and releases, and that they have implemented appropriate security standards and processes.

If the party wishing to rely on the computer evidence cannot provide a satisfactory response in this first stage then the presumption of reliability should be reversed. The second stage would require them to prove that none of the failings revealed in the first stage might affect the reliability of the computer evidence.

Whatever approach is taken, IT professionals would have to offer an opinion on their systems. How reliable are the systems? What relevant evidence might there be that systems are reliable, or unreliable? Can they demonstrate that they are in control of their systems? Can they reassure senior managers who will have to put their name to a legal document and will be very keen to avoid the humiliation that has been heaped upon Post Office and Fujitsu executives, with the possibility of worse to come?

A challenging future

The extent to which we can rely on computers poses uncomfortable challenges for the English law now, but it will be an increasingly difficult problem for the IT world over the coming years. What can we reasonably say about the systems we work with? How reliable are they? What do we know about the bugs? Are we sufficiently clear about the nature of our systems to brief managers who will have to appear in court, or certify legal documents?

It will be essential that developers and testers are clear in their own minds, and in their communications, about what bugs are. They are not just coding errors, and we must try to ensure people outside IT understand that. Testers must also be able to communicate clearly what they have learned about systems, and they must never say or do anything that suggests systems will be free of bugs.

Testers will have to think carefully about the quality of their evidence, not just about the quality of their systems. How good is our evidence? How far can go in making confident statements of certainty? What do we still not know, and what is the significance of that missing knowledge? Much of this will be a question of good management. But organisations will need good testers, very good testers, who can explain what we know, and what we don’t know, about complex systems; testers who have the breadth of knowledge, the insight, and the communication skills to tell a convincing story to those who require the unvarnished truth.

We will need confident, articulate testers who can explain that a lack of certainty about how complex systems will behave is an honest, clear sighted, statement of truth. It is not an admission of weakness or incompetence. Too many people people in IT have build good careers on bullshitting, on pretending they are more confident and certain than they have any right to be. Systems will inevitably become more complex. IT people will increasingly be dragged into litigation, and as the Post Office and Fujitsu executives have found, misplaced and misleading confidence and bluster in court have excruciating personal consequences. Unpleasant though these consequences are, they hardly compare with the tragedies endured by the subpostmasters and subpostmistresses, whose lives were ruined by corporations who insisted that their complex software was reliable.

The future might be difficult, even stressful, for software testers, but they will have a valuable, essential role to play in helping organisations and users to gain a better understanding of the fallible nature of software. To say the future will be interesting is an understatement; it will present exciting challenges and there should be great opportunities for the best testers.

Teachers, children, testers and leaders (2013)

Testing Planet 2020This article appeared in the March 2013 edition of Testing Planet, which is published by the wonderful Ministry of Testing, one of the most exciting developments in software testing over the last 20 years.

I’m moving this article onto my blog from my website, which will shortly be decommissioned.teachers, children, testers and leaders The article was written in January 2013. Looking at it again I see that I was starting to develop arguments I fleshed out over the next couple of years as part of the Stop 29119 campaign against the testing standard, ISO 29119.

The article

“A tester is someone who knows things can be different” – Gerald Weinberg.

Leaders aren’t necessarily people who do things, or order other people about. To me the important thing about leaders is that they enable other people to do better, whether by inspiration, by example or just by telling them how things can be different – and better. The difference between a leader and a manager is like the difference between a great teacher and, well, the driver of the school bus. Both take children places, but a teacher can take children on a journey that will transform their whole life.

My first year or so in working life after I left university was spent in a fog of confusion. I struggled to make sense of the way companies worked; I must be more stupid than I’d always thought. All these people were charging around, briskly getting stuff done, making money and keeping the world turning; they understood what they were doing and what was going on. They must be smarter than me.

Gradually it dawned on me that very many of them hadn’t a clue. They were no wiser than me. They didn’t really know what was going on either. They thought they did. They had their heads down, working hard, convinced they were contributing to company profits, or at least keeping the losses down.

The trouble was their efforts often didn’t have much to do with the objectives of the organisation, or the true goals of the users and the project in the case of IT. Being busy was confused with being useful. Few people were capable of sitting back, looking at what was going on and seeing what was valuable as opposed to mere work creation.

I saw endless cases of poor work, sloppy service and misplaced focus. I became convinced that we were all working hard doing unnecessary, and even harmful, things for users who quite rightly were distinctly ungrateful. It wasn’t a case of the end justifying the means; it was almost the reverse. The means were only loosely connected to the ends, and we were focussing obsessively on the means without realising that our efforts were doing little to help us achieve our ends.

Formal processes didn’t provide a clear route to our goal. Following the process had become the goal itself. I’m not arguing against processes; just the attitude we often bring to them, confusing the process with the destination, the map with the territory. The quote from Gerald Weinberg absolutely nails the right attitude for testers to bring to their work. There are twin meanings. Testers should know there is a difference between what people expect, or assume, and what really is. They should also know that there is a difference between what is, and what could be.

Testers usually focus on the first sort of difference; seeing the product for what it really is and comparing that to what the users and developers expected. However, the second sort of difference should follow on naturally. What could the product be? What could we be doing better?

Testers have to tell a story, to communicate not just the reality to the stakeholders, but also a glimpse of what could be. Organisations need people who can bring clear headed thinking to confusion, standing up and pointing out that something is wrong, that people are charging around doing the wrong things, that things could be better. Good testers are well suited by instinct to seeing what positive changes are possible. Communicating these possibilities, dispelling the fog, shining a light on things that others would prefer to remain in darkness; these are all things that testers can and should do. And that too is a form of leadership, every bit as much as standing up in front of the troops and giving a rousing speech.

In Hans Christian’s Andersen’s story, the Emperor’s New Clothes, who showed a glimpse of leadership? Not the emperor, not his courtiers; it was the young boy who called out the truth, that the Emperor was wearing no clothes at all. If testers are not prepared to tell it like it is, to explain why things are different from what others are pretending, to explain how they could be better then we diminish and demean our profession. Leaders do not have to be all-powerful figures. They can be anyone who makes a difference; teachers, children. Or even testers.

Quality isn’t something, it provides something (2012)

Quality isn’t something, it provides something (2012)

Testing Planet 2020This article appeared in the July 2012 edition of Testing Planet, which is published by the wonderful Ministry of Testing, one of the most exciting developments in software testing over the last 20 years.

The article was written in June 2012, but I don’t think it has dated. It’s about the way we think and work with other people.ministry of testing logo These are timeless problems. The idea behind E-prime is particularly interesting. Dispensing with the verb “to be” isn’t something to get obsessive or ideological about, but testers should be aware of the important distinction between the way something is and the way it behaves. The original article had only four references so I have checked them, converted them to hyperlinks, and changing the link to Lera Boroditsky’s paper to a link to her TED talk on the same subject.

The article

Quality isn't something, it provides somethingA few weeks ago two colleagues, who were having difficulty working together, asked me to act as peacekeeper in a tricky looking meeting in which they were going to try and sort out their working relationship. I’ll call them Tony and Paul. For various reasons they were sparking off each and creating antagonism that was damaging the whole team.

An hour’s discussion seemed to go reasonably well; Tony talking loudly and passionately, while Paul spoke calmly and softly. Just as I thought we’d reached an accommodation that would allow us all to work together Tony blurted out, “you are cold and calculating, Paul, that’s the problem”.

Paul reacted as if he’d been slapped in the face, made his excuses and left the meeting. I then spent another 20 minutes talking Tony through what had happened, before separately speaking to Paul about how we should respond.

I told Tony that if he’d wanted to make the point I’d inferred from his comments, and from the whole meeting, then he should have said “your behaviour and attitude towards me throughout this meeting, and when we work together, strike me as cold and calculating, and that makes me very uncomfortable”.

“But I meant that!”, Tony replied. Sadly, he hadn’t said that. Paul had heard the actual words and reacted to them, rather than applying the more dispassionate analysis I had used as an observer. Paul meanwhile found Tony’s exuberant volatility disconcerting, and responded to him in a very studied and measured style that unsettled Tony.

Tony committed two sins. Firstly, he didn’t acknowledge the two way nature of the problem. It should have been about how he reacted to Paul, rather than trying to dump all the responsibility onto Paul.

Secondly, he said that Paul is cold and calculating, rather than acting in a way Tony found cold, and calculating at a certain time, in certain circumstances.

I think we’d all see a huge difference between being “something”, and behaving in a “something” way at a certain time, in a certain situation. The verb “to be” gives us this problem. It can mean, and suggest, many different things and can create fog where we need clarity.

Some languages, such as Spanish, maintain a useful distinction between different forms of “to be” depending on whether one is talking about something’s identity or just a temporary attribute or state.

The way we think obviously shapes the language we speak, but increasingly scientists are becoming aware of how the language we use shapes the way that we think. [See this 2017 TED talk, “How Language Shapes Thought”, by Lera Boroditsky]

The problem we have with “to be” has great relevance to testers. I don’t just mean treating people properly, however much importance we rightly attach to working successfully with others. More than that, if we shy away from “to be” then it helps us think more carefully and constructively as testers.

This topic has stretched bigger brains than mine, in the fields of philosophy, psychology and linguistics. Just google “general semantics” if you want to give your brain a brisk workout. You might find it tough stuff, but I don’t think you have to master the underlying concept to benefit from its lessons.

Don’t think of it as intellectual navel gazing. All this deep thought has produced some fascinating results, in particular something called E-prime, a form of English that totally dispenses with “to be” in all its forms; no “I am”, “it is”, or “you are”. Users of E-prime don’t simply replace the verb with an alternative. That doesn’t work. It forces you to think and articulate more clearly what you want to say. [See this classic paper by Kellogg, “Speaking in E-prime” PDF, opens in new tab].

“The banana is yellow” becomes “the banana looks yellow”, which starts to change the meaning. “Banana” and “yellow” are not synonyms. The banana’s yellowness becomes apparent only because I am looking at it, and once we introduce the observer we can acknowledge that the banana appears yellow to us now. Tomorrow the banana might appear brown to me as it ripens. Last week it would have looked green.

You probably wouldn’t disagree with any of that, but you might regard it as a bit abstract and pointless. However, shunning “to be” helps us to think more clearly about the products we test, and the information that we report. E-prime therefore has great practical benefits.

The classic definition of software quality came from Gerald Weinburg in his book “Quality Software Management: Systems Thinking”.

“Quality is value to some person”.

Weinburg’s definition reflects some of the clarity of thought that E-prime requires, though he has watered it down somewhat to produce a snappy aphorism. The definition needs to go further, and “is” has to go!

Weinburg makes the crucial point that we must not regard quality as some intrinsic, absolute attribute. It arises from the value it provides to some person. Once you start thinking along those lines you naturally move on to realising that quality provides value to some person, at some moment in time, in a certain context.

Thinking and communicating in E-prime stops us making sweeping, absolute statements. We can’t say “this feature is confusing”. We have to use a more valuable construction such as “this feature confused me”. But we’re just starting. Once we drop the final, total condemnation of saying the feature is confusing, and admit our own involvement, it becomes more natural to think about and explain the reasons. “This feature confused me … when I did … because of …”.

Making the observer, the time and the context explicit help us by limiting or exposing hidden assumptions. We might or might not find these assumptions valid, but we need to test them, and we need to know about them so we understand what we are really learning as we test the product.

E-prime fits neatly with the scientific method and with the provisional and experimental nature of good testing. Results aren’t true or false. The evidence we gather matches our hypothesis, and therefore gives us greater confidence in our knowledge of the product, or it fails to match up and makes us reconsider what we thought we knew. [See this classic paper by Kellogg & Bourland, “Working with E-prime – some practical notes” PDF, opens in new tab].

Scientific method cannot be accommodated in traditional script-driven testing, which reflects a linear, binary, illusory worldview, pretending to be absolute. It tries to deal in right and wrong, pass and fail, true and false. Such an approach fits in neatly with traditional development techniques which fetishise the rigours of project management, rather than the rigours of the scientific method.

map and road This takes us back to general semantics, which coined the well known maxim that the map is not the territory. Reality and our attempts to model and describe it differ fundamentally from each other. We must not confuse them. Traditional techniques fail largely because they confuse the map with the territory. [See this “Less Wrong” blog post].

In attempting to navigate their way through a complex landscape, exponents of traditional techniques seek the comfort of a map that turns messy, confusing reality into something they can understand and that offers the illusion of being manageable. However, they are managing the process, not the underlying real work. The plan is not the work. The requirements specification is not the requirements. The map is not the territory.

Adopting E-prime in our thinking and communication will probably just make us look like the pedantic awkward squad on a traditional project. But on agile or lean developments E-prime comes into its own. Testers must contribute constructively, constantly, and above all, early. E-prime helps us in all of this. It makes us clarify our thoughts and helps us understand that we gain knowledge provisionally, incrementally and never with absolute certainty.

I was not consciously deploying E-prime during and after the fractious meeting I described earlier. But I had absorbed the precepts sufficiently to instinctively realise that I had two problems; Tony’s response to Paul’s behaviour, and Paul’s response to Tony’s outburst. I really didn’t see it as a matter of “uh oh – Tony is stupid”.

E-prime purists will look askance at my failure to eliminate all forms of “to be” in this article. I checked my writing to ensure that I’ve written what I meant to, and said only what I can justify. Question your use of the verb, and weed out those hidden assumptions and sweeping, absolute statements that close down thought, rather than opening it up. Don’t think you have to be obsessive about it. As far as I am concerned, that would be silly!

An abdication of managerial responsibility?

An abdication of managerial responsibility?

The two recent Boeing 737 MAX crashes have been grimly absorbing for software developers and testers. It seems that the crashes were caused by the MCAS system, which should prevent a stall, responding to false data from a sensor by forcing the planes into steep dives despite the attempts of the pilots to make the planes climb. The MCAS problem may have been a necessary condition for disaster, but it clearly was not sufficient. There were many other factors involved. Most strikingly, it seems that MCAS itself may have been working as specified but there were problems in the original design and the way it interfaces with the sensor and crew.

I have no wish to go into all this in serious detail (yet), but I read an article on the Bloomberg website, “Boeing’s 737 Max software outsourced to $9-an-hour engineers” which contained many sentences and phrases that jumped off the screen at me. These snippets all point towards issues that concern me, that I’ve been talking and writing about recently, or that I’ve been long aware of. I’d like to run through them. I’ll use a brief quote from the Bloomberg article in each section before discussing the implications. All software designers and testers should reflect on these issues.

The commoditization of software development and testing

“Boeing has also expanded a design center in Moscow. At a meeting with a chief 787 engineer in 2008, one staffer complained about sending drawings back to a team in Russia 18 times before they understood that the smoke detectors needed to be connected to the electrical system, said Cynthia Cole, a former Boeing engineer who headed the engineers’ union from 2006 to 2010.

‘Engineering started becoming a commodity’, said Vance Hilderman, who co-founded a company called TekSci that supplied aerospace contract engineers and began losing work to overseas competitors in the early 2000s.”

The threat of testing becoming a commodity has been a long standing concern amongst testers. To a large extent we’re already there. However, I’d assumed, naively perhaps, that this was a route chosen by organisations that could get away with poor testing, in the short term at least. I was deeply concerned to see it happening in a safety critical industry.

To summarise the problem, if software development and testing are seen as commodities, bought and sold on the basis of price, then commercial pressures will push quality downwards. The inevitable pressure sends cost and prices spiralling down to the level set by the lowest cost supplier, regardless of value. Testing is particularly vulnerable. When the value of the testing is low then whatever cost does remain becomes more visible and harder to justify.

There is pressure to keep reducing costs, and if you’re getting little value from testing just about any cost-cutting measure is going to look attractive. If you head down the route of outsourcing, offshoring and increasing commoditization, losing sight of value, you will lock yourself into a vicious circle of poor quality.

Iain McCowatt’s EuroSTAR webinar on “The commoditization of testing” is worth watching.

ETTO – the efficiency-thoroughness trade-off

…the planemakers say global design teams add efficiency as they work around the clock.

Ah! There we have it! Efficiency. Isn’t that a good thing? Of course it is. But there is an inescapable trade-off, and organisations must understand what they are doing. There is a tension between the need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. The idea of ETTO, the efficiency-thoroughness trade-off was was popularised by Erik Hollnagel.

Making the organisation more efficient means it is less likely to achieve its important goals. Pursuing vital goals, such as safety, comes at the expense of efficiency, which eliminates margins of error and engineering redundancy, with potentially dangerous results. This is well recognised in safety critical industries, obviously including air transport. I’ve discussed this further in my blog, “The dragons of the unknown; part 6 – Safety II, a new way of looking at safety”.

Drift into failure

“’Boeing was doing all kinds of things, everything you can imagine, to reduce cost, including moving work from Puget Sound, because we’d become very expensive here,’ said Rick Ludtke, a former Boeing flight controls engineer laid off in 2017. ‘All that’s very understandable if you think of it from a business perspective. Slowly over time it appears that’s eroded the ability for Puget Sound designers to design.’”

“Slowly over time”. That’s the crucial phrase. Organisations drift gradually into failure. People are working under pressure, constantly making the trade off between efficiency and thoroughness. They keep the show on the road, but the pressure never eases. So margins are increasingly shaved. The organisation finds new and apparently smarter ways of working. Redundancy is eliminated. The workers adapt the official processes. The organisation seems efficient, profitable and safe. Then BANG! Suddenly it isn’t. The factors that had made it successful turn out to be responsible for disaster.

“Drifting into failure” is an important concept to understand for anyone working with complex systems that people will have to use, and for anyone trying to make sense of how big organisations should work, and really do work. See my blog “The dragons of the unknown; part 4 – a brief history of accident models” for a quick introduction to the drift into failure. The idea was developed by Sidney Dekker. Check out his work.

Conway’s Law

“But outsourcing has long been a sore point for some Boeing engineers, who, in addition to fearing job losses say it has led to communications issues and mistakes.

This takes me to one of my favourites, Conway’s Law. In essence it states that the design of systems corresponds to the design of the organisation. It’s not a normative rule, saying that this should (or shouldn’t) happen. It merely says that it generally does happen. Traditionally the organisation’s design shaped the technology. Nowadays the causation might be reversed, with the technology shaping the organisation. Conway’s Law was intended as a sound heuristic, never a hard and fast rule.

Conway's Law

a slide from one of my courses

Perhaps it is less generally applicable today, but for large, long established corporations I think it still generally holds true.

I’m going to let you in on a little trade secret of IT auditors. Conway’s Law was a huge influence on the way we audited systems and development projects.

corollary to Conway's Law

another slide from one of my courses

Audits were always strictly time boxed. We had to be selective in how we used our time and what we looked at. Modern internal auditing is risk based, meaning we would focus on the risks that posed the greatest threat to the organisation, concentrating on the areas most exposed to risk and looking for assurance that the risks were being managed effectively.

Conway’s Law guided the auditors towards low hanging fruit. We knew that we were most likely to find problems at the interfaces, and these were likely to be particularly serious. This was also my experience as a test manager. In both jobs I saw the same pattern unfold when different development teams, or different companies worked on different parts of a project.

Development teams would be locked into their delivery schedule before the high level requirements were clear and complete, or even mutually consistent. The different teams, especially if they were in different companies, based their estimates on assumptions that were flawed, or inconsistent with other teams’ assumptions. Under pressure to reduce estimates and delivery quickly each team might assume they’d be able to do the minimum necessary, especially at the interfaces; other teams would pick up the trickier stuff.

This would create gaps at the interfaces, and cries of “but I thought you were going to do that – we can’t possibly cope in time”. Or the data that was passed from one suite couldn’t be processed by the next one. Both might have been built correctly to their separate specs, but they weren’t consistent. The result would be last minute solutions, hastily lashed together, with inevitable bugs and unforeseen problems down the line – ready to be exposed by the auditors.

Splitting the work across continents and suppliers always creates big management problems. You have to be prepared for these. The additional co-ordination, chasing, reporting and monitoring takes a lot of effort. This all poses big problems for test managers, who have to be strong, perceptive and persuasive to ensure that the testing is planned consistently across the whole solution.

It is tempting, but dangerous, to allow the testing to be segmented. The different sub-systems are tested according to the assumptions that the build teams find convenient. That might be the easy option at the planning stage, but it doesn’t seem so clever when the whole system is bolted together and crashes as the full implications emerge of all those flawed assumptions, long after they should have been identified and challenged.

Outsourcing and global teams don’t provide a quick fix. Without strong management and a keen awareness of the risks it’s a sure way to let serious problems slip through into production. Surely safety critical industries would be smarter, more responsible? I learned all this back in the 1990s. It’s not new, and when I read Bloomberg’s account of Boeing’s engineering practices I swore, quietly and angrily.

Consequences

“During the crashes of Lion Air and Ethiopian Airlines planes that killed 346 people, investigators suspect, the MCAS system pushed the planes into uncontrollable dives because of bad data from a single sensor.

That design violated basic principles of redundancy for generations of Boeing engineers, and the company apparently never tested to see how the software would respond, Lemme said. ‘It was a stunning fail,’ he said. ‘A lot of people should have thought of this problem – not one person – and asked about it.’

So the consequences of commoditization, ETTO, the drift into failure and complacency about developing and testing complex, safety critical systems with global teams all came together disastrously in the Lion Air and Ehtiopian Airlines crashes.

A lot of people should certainly have thought of this problem. As a former IT auditor I thought of this passage by Norman Marks, a distinguished commentator on auditing. Writing about risk-based auditing he said;

A jaw-dropping moment happened when I explained my risk assessment and audit plan to the audit committee of the oil company where I was CAE (Tosco Corp.). The CEO asked whether I had considered risks relating to the blending of gasoline, diesel, and jet fuel.

As it happened, I had — but it was not considered high risk; it was more a compliance issue than anything else. But, when I talked to the company’s executives I heard that when Exxon performed an enterprise-wide risk assessment, this area had been identified as their #1 risk!

Poorly-blended jet fuel could lead to Boeing 747s dropping out of the sky into densely-packed urban areas — with the potential to bankrupt the largest (at that time) company in the world. A few years later, I saw the effect of poor blending of diesel fuel when Southern California drivers had major problems and fingers were pointed at us as well as a few other oil companies.

In training courses, when I’ve been talking about the big risks that keep the top management awake at night I’ve used this very example; planes crashing. In big corporations it’s easy for busy people to obsess about the smaller risks, those that delay projects, waste money, or disrupt day to day work. These problems hit us all the time. Disasters happen rarely and we can lose sight of the way the organisation is drifting into catastrophic failure.

That’s where auditors, and I believe testers too, come in. They should be thinking about these big risks. In the case of Boeing the engineers, developers and testers should have spoken out about the problems. The internal auditors should certainly have been looking out for it, and these are the people who have the organisational independence and power to object. They have to be listened to.

An abdication of management responsibility?

“Boeing also has disclosed that it learned soon after Max deliveries began in 2017 that a warning light that might have alerted crews to the issue with the sensor wasn’t installed correctly in the flight-display software. A Boeing statement in May, explaining why the company didn’t inform regulators at the time, said engineers had determined it wasn’t a safety issue.

‘Senior company leadership,’ the statement added, ‘was not involved in the review.’”

Senior management was not involved in the review. Doubtless there are a host of reasons why they were not involved. The bottom line, however, is that it was their responsibility. I spent six years as an IT auditor. In that time only one of my audits led to the group’s chief auditor using that nuclear phrase, which incidentally was not directed at IT management. A very senior executive was accused of “abdicating managerial responsibility”. The result was a spectacular display of bad temper and attempted intimidation of the auditors. We didn’t back down. That controversy related to shady behaviour at a subsidiary where the IT systems were being abused and frauds had become routine. It hardly compared to a management culture that led to hundreds of avoidable deaths.

One of the core tenets of Safety II, the new way of looking at safety, is that there is never a single, root cause for failure in complex systems. There are always multiple causes, all of them necessary, but none of them sufficient, on their own, for disaster. The Boeing 737-MAX case bears that out. No one person was responsible. No single act led to disaster. The fault lies with the corporate culture as a whole, with a culture of leadership that abdicated responsibility, that “wasn’t involved”.

David Graeber’s “The Utopia of Rules: On Technology, Stupidity and the Secret Joys of Bureaucracy”

When I gave my talk at CAST 2014 in New York, “Standards – promoting quality or restricting competition?” I was concentrating on the economic aspects of standards. They are often valuable, but they can be damaging and restrict competition if they are misused. A few months later I bought “The Utopia of Rules: On Technology, Stupidity, and the Secret Joys of Bureaucracy” by David Graeber, Professor of Anthropology at the London School of Economics. I was familiar with Graeber as a challenging and insightful writer. I drew on his work when I wrote “Testing: valuable or bullshit?“. The Utopia of Rules also inspired the blog article I wrote recently, “Frozen in time – grammar and testing standards” in which I discussed the similarity between grammar textbooks and standards, which both codify old usages and practices that no longer match the modern world.

What I hadn’t expected from The Utopia of Rules was how strongly it would support the arguments I made at CAST.

Certification and credentialism

Graeber makes the same argument I deployed against certification. It is being used increasingly to enrich special interests without benefiting society. On page 23 Graeber writes:

Almost every endeavor that used to be considered an art (best learned through doing) now requires formal professional training and a certificate of completion… In some cases, these new training requirements can only be described as outright scams, as when lenders, and those prepared to set up the training programs, jointly lobby the government to insist that, say, all pharmacists be henceforth required to pass some additional qualifying examination, forcing thousands already practicing the profession into night school, which these pharmacists know many will only be able to afford with the help of high-interest student loans. By doing this, lenders are in effect legislating themselves a cut of most pharmacists’ subsequent incomes.

To be clear, my stance on ISTQB training is that it educates testers in a legitimate, though very limited, vision of testing. My objection is to any marketing of the qualification as a certification of testing ability, rather than confirmation that the tester has passed an exam associated with a particular training course. I object even more strongly to any argument that possession of the certificate should be a requirement for employment, or for contracting out testing services. It is reasonable to talk of scams when the ability of good testers to earn a living is damaged.

What is the point of it all?

Graeber has interesting insights into how bureaucrats can be vague about the values of the bureaucracy: why does the organisation exist? Bureaucrats focus on efficient execution of rational processes, but what is the point of it all? Often the means become the ends: efficiency is an end in itself.

I didn’t argue that point at CAST, but I have done so many times in other talks and articles (e.g. “Teddy bear methods“). If people are doing a difficult, stressful job and you give them prescriptive methods, processes or standards then they will focus on ticking their way down the list. The end towards which they are working becomes compliance with the process, rather than helping the organisation reach its goal. They see their job as producing the outputs from the process, rather than the outcomes the stakeholders want. I gave a talk in London in June 2015 to the British Computer Society’s Special Interest Group in Software Testing in which I argued that testing lacks guiding principles (PDF, opens in a new tab) and ISO 29119 in particular does not offer clear guidance about the purpose of testing.

In a related argument Graeber makes a point that will be familiar to those who have criticised the misuse of testing metrics.

…from inside the system, the algorithms and mathematical formulae by which the world comes to be assessed become, ultimately, not just measures of value, but the source of value itself.

Rent extraction

The most controversial part of my CAST talk was my argument that the pressure to adopt testing standards was entirely consistent with rent seeking in economic theory. Rent seeking, or rent extraction, is what people do when they exploit failings in the market, or rig the market for their own benefit by lobbying for regulation that happens to benefit them. Instead of creating wealth, they take it from other people in a way that is legal, but which is detrimental to the economy, and society, as a whole.

This argument riled some people who took it as a personal attack on their integrity. I’m not going to dwell on that point. I meant no personal slur. Rent seeking is just a feature of modern economies. Saying so is merely being realistic. David Graeber argued the point even more strongly.

The process of financialization has meant that an ever-increasing proportion of corporate profits come in the form of rent extraction of one sort or another. Since this is ultimately little more than legalized extortion, it is accompanied by ever-increasing accumulation of rules and regulations… At the same time, some of the profits from rent extraction are recycled to select portions of the professional classes, or to create new cadres of paper-pushing corporate bureaucrats. This helps a phenomenon I have written about elsewhere: the continual growth, in recent decades, of apparently meaningless, make-work, “bullshit jobs” — strategic vision coordinators, human resources consultants, legal analysts, and the like — despite the fact that even those who hold such positions are half the time secretly convinced they contribute nothing to the enterprise.

In 2014 I wrote about “bullshit jobs“, prompted partly by one of Graeber’s articles. It’s an important point. It is vital that testers define their job so that it offers real value, and they are not merely bullshit functionaries of the corporate bureaucracy.

Utopian bureaucracies

I have believed for a long time that adopting highly prescriptive methods or standards for software development and testing places unfair pressure on people, who are set up to fail. Graeber makes exactly the same point.

Bureaucracies public and private appear — for whatever historical reasons — to be organized in such a way as to guarantee that a significant proportion of actors will not be able to perform their tasks as expected. It’s in this sense that I’ve said one can fairly say that bureaucracies are utopian forms of organization. After all, is this not what we always say of utopians: that they have a naïve faith in the perfectibility of human nature and refuse to deal with humans as they actually are? Which is, are we not also told, what leads them to set impossible standards and then blame the individuals for not living up to them? But in fact all bureaucracies do this, insofar as they set demands they insist are reasonable, and then, on discovering that they are not reasonable (since a significant number of people will always be unable to perform as expected), conclude that the problem is not with the demands themselves but with the individual inadequacy of each particular human being who fails to live up to them.

Testing standards such as ISO 29119, and its predecessor IEEE 829, don’t reflect what developers and testers do, or rather should be doing. They are at odds with the way people think and work in organisations. These standards attempt to represent a highly complex, sometimes chaotic, process in a defined, repeatable model. The end product is usually of dubious quality, late and over budget. Any review of the development will find constant deviations from the standard. The suppliers, and defenders, of the standard can then breathe a sigh of relief. The sacred standard was not followed. It was the team’s fault. If only they’d done it by the book! The possibility that the developers’ and testers’ apparent sins were the only reason anything was produced at all is never considered. This is a dreadful way to treat people, but in many organisations it has been normal for several decades.

Loss of communication

All of the previous arguments by Graeber were entirely consistent with my own thoughts about how corporate bureaucracies operate. It was fascinating to see an anthropologist’s perspective, but it didnt teach me anything that was really new about how testers work in corporations. However, later in the book Graeber developed two arguments that gave me new insights.

Understanding what is happening in a complex, social situation needs effective two way communication. This requires effort, “interpretive labor”. The greater the degree of compulsion, and the greater the bureaucratic regime of rules and forms, the less need there is for such two way communication. Those who can simply issue orders that must be obeyed don’t have to take the trouble to understand the complexities of the situation they’re managing.

…within relations of domination, it is generally the subordinates who are effectively relegated the work of understanding how the social relations in question really work. … It’s those who do not have the power to hire and fire who are left with the work of figuring out what actually did go wrong so as to make sure it doesn’t happen again.

This ties in with the previous argument about utopian bureaucracies. If you impose a inappropriate standard then poor results will be attributed to the inevitable failure to comply. There is no need for senior managers to understand more, and no need to listen to the complaints, the “excuses”, of the people who do understand what is happening. Interestingly, Graeber’s argument about interpretive labor is is consistent with regulatory theory. Good regulation of complex situations requires ongoing communication between the regulator and the regulated. I explained this in the talk on testing principles I mentioned above (slides 38 and 39).

Fear of play

My second new insight from Graeber arrived when he discussed the nature of play and how it relates to bureaucracies. Anthropologists try to maintain a distinction between games and play, a distinction that is easier to maintain in English than in languages like French and German, which use the same word for both. A game has boundaries, set rules and a predetermined conclusion. Play is more free-form and creative. Novelties and surprising results emerge from the act of playing. It is a random, unpredictable and potentially destructive activity. Graeber finishes his discussion of play and games with the striking observation.

What ultimately lies behind the appeal of bureaucracy is fear of play.

Put simply, and rather simplistically, Graeber means that we use bureaucracy to escape the terror of chaotic reality, to bring a semblance (an illusion?) of control to the uncontrollable.

This gave me an tantalising new insight into the reasons people build bureaucratic regimes in organisations. It sent me off into a whole new field of reading on the anthropology of games and play. This has fascinating implications for the debate about standards and testing. We shy away from play, but it is through play that we learn. I don’t have time now to do the topic justice, and it’s much too big and important a subject to be tacked on to the end of this article, but I will return to it. It is yet another example of the way anthropology can help us understand what we are doing as testers. As a starting point I can heartily recommend David Graeber’s book, “The Utopia of Rules”.

A single source of truth?

Lately in a chatroom for the International Society for Software Testing there has been some discussion about the idea of a “single source of truth”. I’m familiar with this in the sense of database design. Every piece of data is stored once and the design precludes the possibility of inconsistency, of alternative versions of the same data. That makes sense in this narrow context, but the discussion revealed that the phrase is now being used in a different sense. A single source of truth has been used to describe an oracle of oracles, an ultimate specification on which total reliance can be placed. The implications worry me, especially for financial systems, which is my background.

I’m not comfortable with a single source of truth, especially when it applies to things like bank balances, profit and loss figures, or indeed any non-trivial result of calculations. What might make more sense is to talk of a single statement of truth, and that statement could, and should, have multiple sources so the statement is transparent and can be validated. However, I still wouldn’t want to talk about truth in financial statements. For an insurance premium there are various different measures, which have different uses to different people at different times. When people start talking about a single, true, premium figure they are closing off their minds to reality and trying to redefine it to suit their limited vision.

All of these competing measures could be regarded as true in the right context, but there are other measures which are less defensible and which an expert would consider wrong, or misleading, in any context (eg lumping Insurance Premium Tax into the premium figure). That’s all quite aside from the question of whether these measures are accurate on their own terms.

A “single source of truth” reminds me of arguments I’d have with application designers. Sometimes the problem would be that they wanted to eliminate any redundancy in the design. That could make reconciliation and error detection much harder because the opportunities to spot errors would be reduced. If a calculation was wrong it might stay wrong because no-one would know. A different source of friction was the age old problem of analysts and designers determined to stick rigidly to the requirements without questioning them, or even really thinking about the implications. I suspect I was regarded as a pedantic nuisance, creating problems in places the designers were determined no problems could ever exist – or ever be visible.

Accounting for truth

Conventional financial accounting is based on double entry book-keeping, which requires every transaction to be entered twice, in different places so that the accounts as a whole remain in balance. There may be a single, definitive statement of profit, but that is distilled from multiple sources, with an intricate web of balances and documented, supporting assumptions. The whole thing is therefore verifiable, or auditable. But it’s not truth. It’s more a matter of saying “given these assumptions this set of transactions produces the following profit figure”. Vary the assumptions and you have a different and perhaps equally valid figure – so it’s not truth.

For many years academic accountants, e.g. Christopher Napier, have been doing fascinating work that strays over into philosophy. What is this reality that we are trying to understand? That’s ontology. What can we know about it, and what reliance can we put on that knowledge when we try to report it? That’s epistemology. Why are we doing it? That’s teleology.

The most interesting subject I ever studied in accountancy at university was the problem of inflation accounting. £6-£5=£1 might be a crude profit calculation for an item whose inputs cost you £5 and which you sold for £6. But what if the £5 was a cost incurred 11 months ago. You then buy replacement inputs, which now cost £7, but you’d still only be able to sell the finished product for £6. What does it mean to say you made a profit of £1? Who does that help? Couldn’t you also argue that you made a loss of £1?

What does it mean to add money together when the different elements were captured at dates when the purchasing power equivalent of that money was different? You’re adding apples and oranges. The value of money is dependent on what it can buy. Setting aside short term speculation that is what dictates currency exchange rates. £1 is more valuable than €1 because it buys more. It is meaningless to add £1 + €1 and get 2. An individual currency has different values over time, so is it any more meaningful to add different monetary figures without considering what their value was at the time the data was captured?

The academics pointed out all the problems inflation caused and came up with possible, complicated solutions. However, the profession eventually decided it was all just too difficult and pretty much gave up, except for an international standard for accounting in countries experiencing hyper-inflation (defined as greater than 100% over three years, i.e. a persisting annual rate of at least 26%). As at the end of 2014 the qualifying countries are Belarus, Venezuela, Sudan, Iran and Syria (which has rather more to worry about than financial accounting). For the rest of the world, if you want to add 5 apples and 6 oranges, that’s fine. You’ve now got 11 pieces of fruit. Stop worrying and just do the job.

I’m the treasurer for a church, and I’m often asked how much money we’ve got. I never bother going to the online bank statement, because I know that what people really want to know is how much money is available. So I use the church accounts, which factor in the income and payments that haven’t been cleared, and the money we’re due imminently, and the outgoings to which we’re already committed. These different figures all mesh together and provide a figure that we find useful, but which is different from the bank’s view of our balance. Our own accounts never rely on a single source of truth. There are multiple reconciliation checks to try and flag up errors. The hope is that inputting an incorrect amount will generate a visible error. We’re not reporting truth. All we can say is, so far as we know this is as useful and honest a statement of our finances as we can produce for our purposes, for the Church of Scotland, the Office of the Scottish Charity Regulator and the other stakeholders.

It’s messy and complex – deal with it

What’s it all got to do with testing? If your vision of testing is checking whether the apparent functionality is consistent with the specification as represented in the test script then this sort of messy complexity is a tedious distraction. It’s so much easier to pretend you can confirm the truth using a test script.

However, testing is (or should be) a difficult and intellectually demanding process of teasing out the implications of the application for the stakeholders. If you accept that, then you are far more likely to do something valuable if you stop thinking about any single source of truth. You should be thinking instead about possible sources of insight to help you shed light on the various “truths” that the various stakeholders are seeking. Understanding these different needs, and all the nuances that arise from them is essential for testers.

Assuming that there is a single truth that we can attest to with a simple, binary yes/no answer reduces testing to the level of the accountants who have tried to treat accountancy as a simple arithmetical exercise. Five oranges and six apples add up to eleven pieces of fruit; and so do eleven grapes, and eleven melons. So what? That is a useless and misleading piece of information, like the unqualified statement that the product is sound because we found what the script told us to look for. Testers, accountants and auditors all pick up good money because they are required to provide valuable information to people who need it. They should be expected to deal with messy, complex reality. They should not be allowed to get away with trying to redefine reality so it’s easier to handle.

They can’t handle the truth

Have you ever had to deal with managers or users who were sceptical about the time and effort a piece of work would take? Have you ever complained in vain about a project that was clearly doomed to fail right from the start? Have you ever felt that a project was being planned on the basis of totally unjustified optimism?

If you’ve been in IT for a while there’s a good chance you’ve answered “yes” to at least one of these questions. Over the years I grew wearily familiar with the pattern of willful refusal to consider anything but the happy path to a smooth, speedy delivery of everything on the wish list, within a comical budget that is challenging I admit, but realistic if we all pull together.

Over time I gradually came to realise that many senior managers and stakeholders didn’t want the truth. They want the fiction, to be lied to because knowing the truth would make them responsible for dealing with it. In their world it is better to be deceived and then firefight a failing project than to deal honestly with likely problems and uncertainty. Above all, they can’t bring themselves to deal with the truth of uncertainty. It is far more comfortable to pretend that uncertainty is evidence of lack of competence, that problems can be anticipated, that risks can be ignored or managed out of existence, that complexity can be eliminated by planning and documentation (and by standards).

Telling the truth – a brave act in an unfair world

Perhaps the toughest roles in IT are those that are senior enough to be accountable for the results, but too junior to beat uncomfortable truths into the brains of those who really don’t want to know.

These budding fall guys have the nous and experience to see what is going to happen. One of the rarely acknowledged skills of these battle scarred veterans is the ability to judge the right moment and right way to start shouting the truth loudly. Reveal all too early and they can be written off as negative, defeatist, “not a team player”. Reveal it too late and they will be castigated for covering up imminent failure, and failing to comply with some standard or process. Everyone fails to comply. Not everyone is going to be kicked for it, but late deliverers of bad news are dead meat.

Of course that’s not fair, but that’s hardly the point. Fairness isn’t relevant if the culture is one where rationality, prudence and pragmatism all lead to crazy behaviour because that is what is rewarded. People rationally adapt to the requirement to stop thinking when they see others being punished for honesty and insight.

What is an estimate?

So what’s the answer? The easy one is, “run, and run fast”. Get out and find a healthier culture. However, if you’re staying then you have to deal with the problem of handling senior people who can’t handle the truth.

It is important to be clear in your own mind about what you are being asked for when you have to estimate. Is it a quote? Is there an implied instruction that something must be delivered by a certain date? Are there certain deliverables that are needed by that date, and others that can wait? Could it be a starting point for negotiation? See this article I wrote a few years ago.

Honesty is non-negotiable

It’s a personal stance, but honesty about uncertainty and the likelihood of serious but unforeseeable problems is non-negotiable. I know others have thought I have a rather casual attitude towards job security and contract renewal! However, I can’t stomach the idea of lingering for years in an unhealthy culture. And it’s not as if honesty means telling the senior guys who don’t want the truth that they are morons (even if they are).

Honesty requires clear thinking, and careful explanation of doubt and uncertainty. It means being a good communicator, so that the guys who take the big decisions have a better understanding that your problems will quickly become their problems. It requires careful gathering of relevant information if you are ordered into a project death march so that you can present a compelling case for a rethink when there might still be time for the senior managers and stakeholders to save face. Having the savvy to help the deliberately ignorant to handle the truth really is a valuable skill. Perhaps Jack Nicholson’s character from “A Few Good Men” isn’t such a great role model, however. His honesty in that memorable scene resulted in him being arrested!

Why do you need the report?

Have you ever wondered what the purpose of a report was, whether it was a status report that you had to complete, or a report generated by an application? You may have wondered if there was any real need for the report, and whether anyone would miss it if no-one bothered to produce it.

I have come across countless examples of reports that seemed pointless. What was worse, their existence shaped the job we had to do. The reports did not help people to do the job. They dictated how we worked; production, checking and filing of the reports for future inspection were a fundamental part of the job. In any review of the project, or of our our performance, they were key evidence.

My concern, and cynicism, were sharpened by an experience as an auditor when I saw at first hand how a set of reports were defined for a large insurance company. To misquote Otto von Bismarck’s comment on the creation of laws; reports are like sausages, it is best not to see them being made.

The company was developing a new access controls system, to allow managers to assign access rights and privileges to staff who were using the various underwriting, claims and accounts applications. As an auditor I was a stakeholder, helping to shape the requirements and advising on the controls that might be needed and on possible weaknesses that should be avoided.

One day I was approached by the project manager and a user from the department that defined the working practices at the hundred or so branch offices around the UK and Republic of Ireland. “What control reports should the access control system provide?” was their question.

I said that was not my decision. The reports could not be treated as a bolt on addition to the system. They should not be specified by auditors. The application should provide managers with the information they needed to do their jobs, and if it wasn’t feasible to do that in real time, then reports should be run off to help them. It all depended on what managers needed, and that depended on their responsibilities for managing access. The others were unconvinced by my answer.

A few weeks later the request for me to specify a suite of reports was repeated. Again I declined. This time the matter was escalated. The manager of the branch operations department sat in on the meeting. He made it clear that a suite of reports must be defined and coded by the end of the month, ready for the application to go live.

He was incredulous that I, as an auditor, would not specify the reports. His reasoning was that when auditors visited branches they would presumably check to see whether the reports had been signed and filed. I explained that it was the job of his department to define the jobs and responsibilities of the branch managers, and to decide what reports these managers would need in order to fulfill their responsibilities and do their job.

The manager said that was easy; it was the responsibility of the branch managers to look at the reports, take action if necessary, then sign the reports and file them. That was absurd. I tried to explain that this was all back to front. At the risk of stating the obvious, I pointed out that reports were required only if there was a need for them. That need had to be identified so that the right reports could be produced.

I was dismissed as a troublesome timewaster. The project manager was ordered to produce a suite of reports, “whatever you think would be useful”. The resulting reports were simply clones of the reports that came out from an older access control system, designed for a different technical and office environment, with quite different working practices.

The branch managers were then ordered to check them and file them. The branch operations manager had taken decisive action. The deadline was met. Everyone was happy, except of course the poor branch managers who had to wade through useless reports, and the auditors of course. We were dismayed at the inefficiency and sheer pointlessness of producing reports without any thought about what their purpose was.

That highlighted one of the weaknesses of auditors. People invariably listened to us if we pointed out that something important wasn’t being done. When we said that something pointless was being done there was usually reluctance to stop it.

Anything that people have got used to doing, even if it is wasteful, ineffective and inefficient, acquires its own justification over time. The corporate mindset can be “this is what we do, this is how we do it”. The purpose of the corporate bureaucracy becomes the smooth running of the bureaucracy. Checking reports was a part of a branch manager’s job. It required a mental leap to shift to a position where you have to think whether reports are required, and what useful reporting might comprise. It’s so much easier to snap, “just give us something useful” and move on. That’s decisive management. That’s what’s rewarded. Thinking? Sadly, that can be regard as a self-indulgent, waste of time.

However, few things are more genuinely wasteful of the valuable time of well paid employees than reporting that has no intrinsic value. Reporting that forces us to adapt our work to fit the preconceptions of the report designer gobbles up huge amounts of time and stop us doing work that could be genuinely valuable. The preconceptions that underpin many reports and metrics may once have been justified, and have fitted in with contemporary working practices. However, these preconceptions need to be constantly challenged and re-assessed. Reports and metrics do shape the way we work, and the way we are assessed. So we need to keep asking, “just why do you need the report?”

DRE: changing reality so we can count it

It’s usually true that our attitudes and beliefs are shaped by our early experiences. That applies to my views on software development and testing. My first experience of real responsibility in development and testing was with insurance financial systems. What I learned and experienced will always remain with me. I have always struggled with some of the tenets of traditional testing, and in particular the metrics that are often used.

There has been some recent discussion on Twitter about Defect Removal Efficiency. It was John Stephenson’s blog that set me thinking once again about DRE, a metric I’d long since consigned to my mental dustbin.

If you’re unfamiliar with the metric it is the number of defects found before implementation expressed as a percentage of all the defects discovered within a certain period of going live (i.e live defects plus development defects). The cut off is usually 90 days from implementation. So the more defects reported in testing and the fewer in live running the higher the percentage, and the higher the quality (supposedly). A perfect application would have no live defects and therefore a DRE score of 100%; all defects were found in testing.

John’s point was essentially that DRE can be gamed so easily that it is worthless. I agree. However, even if testers and developers tried not to manipulate DRE, even if it couldn’t be gamed at all it would still be an unhelpful and misleading metric. It’s important to understand why so we can exercise due scepticism about other dodgy metrics, and flawed approaches to software development and testing.

DRE is based on a view of software development, testing and quality that I don’t accept. I don’t see a world in which such a metric might be useful, and it contradicts everything I learned in my early days as a team leader, project manager and test manager.

Here are the four reasons I can’t accept DRE as a valid metric. There are other reasons, but these are the ones that matter most to me.

Software development is not a predictable, sequential manufacturing activity

DRE implicitly assumes that development is like manufacturing, that it’s a predictable exercise in building a well understood and defined artefact. At each stage of the process defects should be progressively eliminated, till the object is completed and DRE should have reached 95% (or whatever).

You can see this sequential mindset clearly in this article by Capers Jones, “Measuring Defect Potentials and Defect Removal Efficency” (PDF, opens in new tab) from QA Journal in 2008.

“In order to achieve a cumulative defect removal efficiency of 95%, it will be necessary to use approximately the following sequence of at least eight defect removal activities:

• Design inspections
• Code inspections
• Unit test
• New function test
• Regression test
• Performance test
• System test
• External Beta test

To go above 95%, additional removal stages will be needed. For example requirements inspections, test case inspections, and specialized forms of testing such as human factors testing, performance testing, and security testing add to defect removal efficiency levels.”

Working through sequential “removal stages” is not software development or testing as I recognise them. When I was working on these insurance finance systems there was no neat sequence through development with defects being progressively removed. Much of the early development work could have been called proof of concept. It wasn’t a matter of coding to a specification and then unit testing against that spec. We were discovering more about the problem and experimenting to see what would work for our users.

Each of these “failures” was a precious nugget of extra information about the problem we were trying to solve. The idea that we would have improved quality by recording everything that didn’t work and calling it a defect would have been laughable. Yet this is the implication of another statement by Capers Jones in a paper on the International Function Point Users Group website (December 2012), “Software Defect Origins and Removal Methods” (PDF, opens in new tab).

“Omitting bugs found in requirements, design, and by unit testing are common quality omissions.”

So experimenting to learn more about the problem without treating the results as formal defects is a quality omission? Tying up developers and testers in bureaucracy by extending formal defect management into unit testing is the way to better quality? I don’t think so.

Once we start to change the way people work simply so that we can gather data for metrics we are not simply encouraging them to game the system. It is worse than that. We are trying to change reality to fit our ability to describe it. We are pretending we can change the territory to fit the map.

Quality is not an absence of something

My second objection to DRE in principle is quite simple. It misrepresents quality. ”Quality is value to some person” as Jerry Weinberg famously said in his book “Quality Software Management: Systems Thinking”.

The insurance applications we were developing were intended to help our users understand the business and products better so that they could take better decisions. The quality of the applications was a matter of how well they helped our users to do that. These users were very smart and had a very clear idea of what they were doing and what they needed. They would have bluntly and correctly told us we were stupid and trying to confuse matters by treating quality as an absence of defects. That takes me on to my next objection to DRE.

Defects are not interchangeable objects

A defect is not an object. It possesses no qualities except those we choose to grant it in specific circumstances. In the case of my insurance applications a defect was simply something we didn’t understand that required investigation. It might be a problem with the application, or it might be some feature of the real world that we hadn’t known about and which would require us to change the application to handle it.

We never counted defects. What is the point of adding up things I don’t understand or don’t know about? I don’t understand quantum physics and I don’t know off hand what colour socks my wife is wearing today. Adding the two pieces of ignorance together to get two is not helpful.

Our acceptance criteria never mentioned defect numbers. The criteria were expressed in accuracy targets against specific oracles, e.g. we would have to reconcile our figures to within 5% of the general ledger. What was the basis for the 5% figure? Our users knew from experience that 95% accuracy was good enough to let them take significantly better decisions than they could without the application. 100% was an ideal, but the users knew that the increase in development time to try and reach that level of accuracy would impose a significant business cost because crucial decisions would have had to be taken blindfolded while we tried to polish up a perfect application.

If there was time we would investigate discrepancies even within the 5% tolerance. If we went above 5% in testing or live running then that was a big deal and we would have to respond accordingly.

You may think that this was a special case. Well yes, but every project has its own business context and user needs. DRE assumes a standard world in which 95% DRE is necessarily better than 90%. The additional cost and delay of chasing that extra 5% could mean the value of the application to the business is greatly reduced. It all depends. Using DRE to compare the quality of different developments assumes that a universal, absolute standard is more relevant than the needs of our users.

Put simply, when we developed these insurance applications, counting defects added nothing to our understanding of what we were doing or our knowledge about the quality of the software. We didn’t count test cases either!

DRE has a simplistic, standardised notion of time

This problem is perhaps related to my earlier objection that DRE assumes developers are manufacturing a product, like a car. Once it rolls off the production line it should be largely defect free. The car then enters its active life and most defects should be revealed fairly quickly.

That analogy made no sense for insurance applications, which are highly date sensitive. Insurance contracts might be paid for up front, or in instalments, but they earn money on a daily basis. At the end of the contract period, typically a year, they have to be renewed. The applications consist of different elements performing distinct roles according to different timetables.

DRE requires an arbitrary cut off beyond which you stop counting the live defects and declare a result. It’s usually 90 days. Applying a 90 day cut-off for calculating DRE and using that as a measure of quality would have been ridiculous for us. Worse, if that had been a measure for which we were held accountable it would have distorted important decisions about implementation. With new insurance applications you might convert all the data from the old application when you implement the new one. Or you might convert policies as they come up for renewal.

Choosing the right tactics for conversion and implementation was a tricky exercise balancing different factors. If DRE with a 90 day threshold were applied then different tactics would give different DRE scores. The team would have a strong incentive to choose the approach that would produce the highest DRE score, and not necessarily the one that was best for the company.

Now of course you could tailor the way DRE is calculated to take account of individual projects, but the whole point of DRE is that people who should know better want to make comparisons across different projects, organisations and industries and decide which produces greater quality. Once you start allowing for all these pesky differences you undermine that whole mindset that wants to see development as a manufacturing process that can be standardised.

DRE matters – for the wrong reasons

DRE might be flawed beyond redemption but metrics like that matter to important people for all the wrong reasons. The logic is circular. Development is like manufacturing, therefore a measure that is appropriate for manufacturing should be adopted. Once it is being used to beat up development shops who score poorly they have an incentive to distort their processes to fit the measure. You have to buy in the consultancy support to adapt the way you work. The flawed metric then justifies the flawed assumptions that underpin the metric. It might be logical nonsense, but there is money to be made there.

So DRE is meaningless because it can be gamed? Yes, indeed, but any serious analysis of the way DRE works reveals that it would be a lousy measure, even if everyone tries to apply it responsibly. Even if it were impossible to game it would still suck. It’s trying to redefine reality so we can count it.