“Fix on failure” – a failure to understand failure

Wikipedia is a source that should always be treated with extreme scepticism and the article on the “Year 2000 problem” is a good example. It is now being widely quoted on the subject, even though it contains some assertions that are either clearly wrong, or implausible, and lacking any supporting evidence.

Since I wrote about ”Y2K – why I know it was a real problem” last week I’ve been doing more reading around the subject. I’ve been struck by how often I’ve come across arguments, or rather assertions, that a “fix on failure” response would have been the best response. Those who argue that Y2K was a big scare and a scam usually offer a rewording of this gem from the Wikipedia article.

”Others have claimed that there were no, or very few, critical problems to begin with, and that correcting the few minor mistakes as they occurred, the “fix on failure” approach, would have been the most efficient and cost-effective way to solve the problem.”

There is nothing to back up these remarkable claims, but Wikipedia now seems to be regarded as an authoritative source on Y2K. The first objection to these assertions is that the problems that occurred tell us nothing about those that were prevented. At the site where I worked as a test manager we triaged our work so that important problems were fixed in advance and trivial ones were left to be fixed when they occurred. So using the problems that did occur to justify “fix on failure” for all problems is a facile argument at best.

However, my objection to “fix on failure” runs deeper than that. The assertion that “fix on failure was the right approach for everything is infantile. Infantile? Yes, I use the word carefully. It ignores big practical problems that would have been obvious to anyone with experience of developing and supporting large, complicated applications. Perhaps worse, it betrays a dangerously naive understanding of “failure”, a misunderstanding that it shares with powerful people in software testing nowadays. Ok, I’m talking about the standards lobby there.

”Fix on failure” – deliberate negligence

Firstly, “fix on failure” doesn’t allow for the seriousness of the failure. As Larry Burkett wrote;

“It is the same mindset that believes it is better to put an ambulance at the bottom of a cliff rather than a guardrail at the top”.

“Fix on failure” could have been justified only if the problems were few and minor. That is a contentious assumption that has to be justified. However, the only justification on offer is that those problems which occurred would have been suitable for “fix on failure”. It is a circular argument lacking evidence or credibility, and crucially ignores all the serious problems that were prevented.

Once one acknowledges that there were a huge number of problems to be fixed one has to deal with the practical consequences of “fix on failure”. That approach does not allow for the difficulty of managing masses of simultaneous failures. These failures might not have been individually serious, but the accumulation might have been crippling. It would have been impossible to fix them all within acceptable timescales. There would have been insufficient staff to do the work in time.

Release and configuration management would have posed massive problems. If anyone tells you Y2K was a scam ask them how they would have handled configuration and release management when many interfacing applications were experiencing simultaneous problems. If they don’t know what you are talking about then they don’t know what they are talking about.

Of course not all Y2K problems would have occurred on 1st January 2000. Financial applications in particular would have been affected at various points in 1999 and even earlier. That doesn’t affect my point, however. There might have been a range of critical dates across the whole economy, but for any individual organisation there would have been relatively few, each of which would have brought a massive, urgent workload.

Attempting to treat Y2K problems as if they were run of the mill, “business as usual” problems, as advocated by sceptics, betrays appalling ignorance of how a big IT shop works. They are staffed and prepared to cope with a relatively modest level of errors and enhancements in their applications. The developers who support applications aren’t readily inter-changeable. They’re not fungible burger flippers. Supporting a big complicated application requires extensive experience with that application. Staff have to be rotated in and out carefully and piecemeal so that a core of deep experience remains.

IT installations couldn’t have coped with Y2K problems in the normal course of events any more than garages could cope if all cars started to have problems. The Ford workshops would be overwhelmed when the Fords started breaking down, the Toyota dealers would seize up when the Toyotas suffered.

The idea that “fix on failure” was a generally feasible and responsible approach simply doesn’t withstand scrutiny. Code that wasn’t Y2K-compliant could be spotted at a glance. It was then possible to predict the type of error that might arise, if not always the exact consequences. Why on earth would anyone wait to see if one could detect obscure, but potentially serious distortions? Why would anyone wait to let unfortunate citizens suffer or angry customers complain?

The Y2K sceptics argue that organisations took expensive pre-emptive action because they were scared of being sued. Well, yes, that’s true, and it was responsible. The sceptics were advocating a policy of conscious, deliberate negligence. The legal consequences would quite rightly have been appalling. “Fix on failure” was never a serious contribution to the debate.

”Fix on failure” – a childlike view of failure

The practical objections to a “fix on failure” strategy were all hugely significant. However, I have a deeper, fundamental objection. “Fix on failure” is a wholly misguided notion for anything but simple applications. It is based on a childlike, binary view of failure. We are supposed to believe an application is either right or wrong; it is working or it is broken; that if there is a Y2K problem then the application obligingly falls over. Really? That is not my experience.

With complicated financial applications an honest and constructive answer to the question “is the application correct?” would be some variant on “what do you mean by correct?”, or “I don’t know. It depends”. It might be possible to say the application is definitely not correct if it is producing obvious garbage. But the real difficulty is distinguishing between the seriously inaccurate, but plausible, and the acceptably accurate. Discussion of accuracy requires understanding of critical assumptions, acceptable margins of error, confidence levels, the nature and availability of oracles, and the business context of the application.

I’ve never seen any discussion of Y2K by one of the “sceptical” conspiracy theorists that showed any awareness of these factors. There is just the naïve assumption that a “failed” application is like a patient in a doctor’s surgery, saying “I’m sick, and here are my symptons”.

Complicated applications have to be nursed and constantly monitored to detect whether some new, extraneous factor, or long hidden bug, is skewing the figures. A failing application might appear to be working as normal, but it would be gradually introducing distortions.

Testing highly complicated applications is not a simple, binary exercise of determining “pass or fail”. Testing has to be a process of learning about the application and offering an informed opinion about what it is, and what it does. That is very different from checking it against our preconceptions, which might have been seriously flawed. Determining accuracy is more a matter of judgement than inspection.

Throughout my career I have seen failures and problems of all types, with many different causes. However, if there is a single common underlying theme then the best candidate would be the illusion that development is like manufacturing, with a predictable end product that can be checked. The whole development and testing process is then distorted to try and fit the illusion.

The advocates of Y2K “fix on failure” had much in common with the ISO 29119 standards lobby. Both shared that “manufacturing” mindset, that unwillingness to recognise the complexity of development, and the difficulty of performing good, effective testing. Both looked for certainty and simplicity where it was not available.

Good testers know that an application is not necessarily “correct” just because it has passed the checks on the test script. Likewise failure is not an absolute concept. Ignoring these truths is ignoring reality, trying to redefine it so we can adopt practices that seem more efficient and effective. I suspect the mantra that “fix on failure would have been more effective and efficient” has its roots with economists, like the Australian Quiggin, who wanted to assume complexity away. See this poor paper (PDF, opens in a new tab).

Doing the wrong thing is never effective. Negligence is rarely efficient. Reality is uncomfortable. We have to understand that and know what we are talking about before coming up with simplistic, snake-oil solutions that assume simplicity where the reality is complexity.


4 thoughts on ““Fix on failure” – a failure to understand failure

  1. I love the line in the “poor paper” you cite above: “Most of this (work on the Y2K issue) can be seen, in retrospect, to have been unproductive or, at least, misdirected.” The key words here include can be seen, and in retrospect. Can be seen by whom, precisely? Based what information or (zounds!) experience? Is there any other way it can be seen? “In retrospect” is a phrase that Thanksgiving diners can apply to their understanding of the world, but that turkeys cannot.

    Here’s an example of the “fix on failure” approach in action: http://www.pearsoneducation.nl/Laudon_9/pdf/E%20Case%20h9.pdf. What this article doesn’t mention was the cost of that failure: $20 million, if memory serves.

    Someone should read The Black Swan to the author of the poor paper.

    —Michael B.

    • In fairness to the author of that paper it wasn’t just hindsight. He was saying the same things in 1999. Then, in the aftermath, he went looking for evidence that would back up his predictions, while ignoring inconvenient evidence. It’s always the same. The “sceptics” draw their sample from the tiddlers that were never going to be seriously affected, while ignoring the big players who would have been crippled.

      Quiggin’s paper could be dissected line by line for errors and unsubstantiated claims. To be honest he totally lost me when he introduced the acronym TEOTWAWKI (the end of the world as we know it) right at the start. That’s not a serious academic paper.

      • In the Wikipedia article on Y2K Quiggin has another article cited, from the Australian Financial Review in 1999.

        In it he argues that a justification for “fix on failure” was “if it ain’t broke, don’t fix it”. I’m not kidding. We shouldn’t have fixed and tested applications. We should just have assumed “it ain’t broke” and adopted a “wait and see approach to the problem”. What? Even for business critical applications with non-compliant code that a trainee programmer could spot?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s