Risk mitigation versus optimism – Brexit & Y2K

The continuing Brexit shambles reminds me of a row in the approach to Y2K at the large insurer where I was working for IBM. Should a business critical back office system on which the company accounts depended be replaced or made Y2K compliant? I was brought in to review the problem and report with a recommendation.

One camp insisted that as an insurer they had to manage risk, so Y2K compliance with a more leisurely replacement was the only responsible option. The opposing camp consisted of business managers who had been assigned responsibility for managing new programmes. They would be responsible for a replacement and they insisted they could deliver a new system on time, even though they had no experience of delivering such an application. My investigation showed me they had no grasp of the business or technical complexities, but they firmly believed that waterfall projects could be forced through successfully by charismatic management. All the previous failures were down to “weak management” and “bad luck”. Making the old system compliant would be an insult to their competence.

My report pointed out the relative risks & costs of the options. I sold Y2K compliance to the UK Accountant, sketching out the implications of the various options on a flipchart in a 30 minute chat so I had agreement before I’d even finished the report. The charismatic crew were furious, but silenced. The old system was Y2K compliant in time. The proposed new one could not have been delivered when it was needed. It would have been sunk by problems with upstream dependencies I was aware of but the charismatics refused to acknowledge as being relevant.

If the charismatics’ solution had been chosen the company would have lost the use of a business critical application in late 1999. No contingency arrangements would have been possible and the company would have been unable to produce credible reserves, vital for an insurance company’s accounts. The external auditors would have been unable to pass the accounts. The share price would have collapsed and the company would have been sunk. I’m sure the charismatics would have blamed bad luck, and other people. “It was those dependencies, not us. We were let down”. That was a large, public limited company. If my advice had been rejected the people who wanted the old system to be made Y2K compliant would have brought in the internal auditors, who in turn would have escalated their concern to the board’s audit committee if necessary. If there had still been no action they would have taken the matter to the external auditors.

That’s how things should work in a big corporation. Of course they often don’t and the auditors can lose their nerve, or choose to hope that things will work out well. There is at least a mechanism that can be followed if people decide to perform their job responsibly. With Brexit there is a cavalier unwillingness to think about risk and complexity that is reminiscent of those irresponsibly optimistic managers. We are supposed to trust politicians who can offer us nothing more impressive than “trust me” and “it’s their fault” and who are offering no clear contingency arrangements if their cheery optimism proves unfounded. There is a mechanism to hold them to account. That is the responsibility of Parliament. Will the House of Commons step up to the job? We’ll see.


“Fix on failure” – a failure to understand failure

Wikipedia is a source that should always be treated with extreme scepticism and the article on the “Year 2000 problem” is a good example. It is now being widely quoted on the subject, even though it contains some assertions that are either clearly wrong, or implausible, and lacking any supporting evidence.

Since I wrote about ”Y2K – why I know it was a real problem” last week I’ve been doing more reading around the subject. I’ve been struck by how often I’ve come across arguments, or rather assertions, that a “fix on failure” response would have been the best response. Those who argue that Y2K was a big scare and a scam usually offer a rewording of this gem from the Wikipedia article.

”Others have claimed that there were no, or very few, critical problems to begin with, and that correcting the few minor mistakes as they occurred, the “fix on failure” approach, would have been the most efficient and cost-effective way to solve the problem.”

There is nothing to back up these remarkable claims, but Wikipedia now seems to be regarded as an authoritative source on Y2K. The first objection to these assertions is that the problems that occurred tell us nothing about those that were prevented. At the site where I worked as a test manager we triaged our work so that important problems were fixed in advance and trivial ones were left to be fixed when they occurred. So using the problems that did occur to justify “fix on failure” for all problems is a facile argument at best.

However, my objection to “fix on failure” runs deeper than that. The assertion that “fix on failure was the right approach for everything is infantile. Infantile? Yes, I use the word carefully. It ignores big practical problems that would have been obvious to anyone with experience of developing and supporting large, complex applications. Perhaps worse, it betrays a dangerously naive understanding of “failure”, a misunderstanding that it shares with powerful people in software testing nowadays. Ok, I’m talking about the standards lobby there.

”Fix on failure” – deliberate negligence

Firstly, “fix on failure” doesn’t allow for the seriousness of the failure. As Larry Burkett wrote;

“It is the same mindset that believes it is better to put an ambulance at the bottom of a cliff rather than a guardrail at the top”.

“Fix on failure” could have been justified only if the problems were few and minor. That is a contentious assumption that has to be justified. However, the only justification on offer is that those problems which occurred would have been suitable for “fix on failure”. It is a circular argument lacking evidence or credibility, and crucially ignores all the serious problems that were prevented.

Once one acknowledges that there were a huge number of problems to be fixed one has to deal with the practical consequences of “fix on failure”. That approach does not allow for the difficulty of managing masses of simultaneous failures. These failures might not have been individually serious, but the accumulation might have been crippling. It would have been impossible to fix them all within acceptable timescales. There would have been insufficient staff to do the work in time.

Release and configuration management would have posed massive problems. If anyone tells you Y2K was a scam ask them how they would have handled configuration and release management when many interfacing applications were experiencing simultaneous problems. If they don’t know what you are talking about then they don’t know what they are talking about.

Of course not all Y2K problems would have occurred on 1st January 2000. Financial applications in particular would have been affected at various points in 1999 and even earlier. That doesn’t affect my point, however. There might have been a range of critical dates across the whole economy, but for any individual organisation there would have been relatively few, each of which would have brought a massive, urgent workload.

Attempting to treat Y2K problems as if they were run of the mill, “business as usual” problems, as advocated by sceptics, betrays appalling ignorance of how a big IT shop works. They are staffed and prepared to cope with a relatively modest level of errors and enhancements in their applications. The developers who support applications aren’t readily inter-changeable. They’re not fungible burger flippers. Supporting a large complex application requires extensive experience with that application. Staff have to be rotated in and out carefully and piecemeal so that a core of deep experience remains.

IT installations couldn’t have coped with Y2K problems in the normal course of events any more than garages could cope if all cars started to have problems. The Ford workshops would be overwhelmed when the Fords started breaking down, the Toyota dealers would seize up when the Toyotas suffered.

The idea that “fix on failure” was a generally feasible and responsible approach simply doesn’t withstand scrutiny. Code that wasn’t Y2K-compliant could be spotted at a glance. It was then possible to predict the type of error that might arise, if not always the exact consequences. Why on earth would anyone wait to see if one could detect obscure, but potentially serious distortions? Why would anyone wait to let unfortunate citizens suffer or angry customers complain?

The Y2K sceptics argue that organisations took expensive pre-emptive action because they were scared of being sued. Well, yes, that’s true, and it was responsible. The sceptics were advocating a policy of conscious, deliberate negligence. The legal consequences would quite rightly have been appalling. “Fix on failure” was never a serious contribution to the debate.

”Fix on failure” – a childlike view of failure

The practical objections to a “fix on failure” strategy were all hugely significant. However, I have a deeper, fundamental objection. “Fix on failure” is a wholly misguided notion for anything but simple applications. It is based on a childlike, binary view of failure. We are supposed to believe an application is either right or wrong; it is working or it is broken; that if there is a Y2K problem then the application obligingly falls over. Really? That is not my experience.

With complex financial applications an honest and constructive answer to the question “is the application correct?” would be some variant on “what do you mean by correct?”, or “I don’t know. It depends”. It might be possible to say the application is definitely not correct if it is producing obvious garbage. But the real difficulty is distinguishing between the seriously inaccurate, but plausible, and the acceptably inaccurate that is good enough to be useful. . Discussion of accuracy requires understanding of critical assumptions, acceptable margins of error, confidence levels, the nature and availability of oracles, and the business context of the application.

I’ve never seen any discussion of Y2K by one of the “sceptical” conspiracy theorists that showed any awareness of these factors. There is just the naïve assumption that a “failed” application is like a patient in a doctor’s surgery, saying “I’m sick, and here are my symptons”.

Complex applications have to be nursed and constantly monitored to detect whether some new, extraneous factor, or long hidden bug, is skewing the figures. A failing application might appear to be working as normal, but it would be gradually introducing distortions.

Testing complex or highly complicated applications is not a simple, binary exercise of determining “pass or fail”. Testing has to be a process of learning about the application and offering an informed opinion about what it is, and what it does. That is very different from checking it against our preconceptions, which might have been seriously flawed. Determining accuracy is more a matter of judgement than inspection.

Throughout my career I have seen failures and problems of all types, with many different causes. However, if there is a single common underlying theme then the best candidate would be the illusion that development is like manufacturing, with a predictable end product that can be checked. The whole development and testing process is then distorted to try and fit the illusion.

The advocates of Y2K “fix on failure” had much in common with the ISO 29119 standards lobby. Both shared that “manufacturing” mindset, that unwillingness to recognise the complexity of development, and the difficulty of performing good, effective testing. Both looked for certainty and simplicity where it was not available.

Good testers know that an application is not necessarily “correct” just because it has passed the checks on the test script. Likewise failure is not an absolute concept. Ignoring these truths is ignoring reality, trying to redefine it so we can adopt practices that seem more efficient and effective. I suspect the mantra that “fix on failure would have been more effective and efficient” has its roots with economists, like the Australian Quiggin, who wanted to assume complexity away. See this poor paper (PDF, opens in a new tab).

Doing the wrong thing is never effective. Negligence is rarely efficient. Reality is uncomfortable. We have to understand that and know what we are talking about before coming up with simplistic, snake-oil solutions that assume simplicity where the reality is complexity.