It’s confession time. I was a Y2K test manager for IBM. As far as some people are concerned that means I was party to a huge scam that allowed IT companies to make billions out of spooking poor deluded politicians and the public at large. However, my role in Y2K means I know what I am talking about, so when I saw some recent comment that it was all nothing more than hype I felt the need to set down my first hand experience. At the time, and in the immediate aftermath of Y2K, we were constrained by client confidentiality from explaining what we did, but 15 years on I feel comfortable about speaking out.
Was there a huge amount of hype? Unquestionably.
Was money wasted? Certainly, but show me huge IT programmes where that hasn’t happened.
Would it have been better to do nothing and adopt a “fix on failure” approach? No, emphatically not as a general rule and I will explain why.
There has been a remarkable lack of studies of Y2K and the effectiveness of the actions that were taken to mitigate the problem. The field has been left to those who saw few serious incidents and concluded that this must mean there could have been no serious problem to start with.
The logic runs as follows. Action was taken in an attempt to turn outcome X into outcome Y. The outcome was Y. Therefore X would not have happened anyway and the action was pointless. The fallacy is so obvious it hardly needs pointing out. If action was pointless then the critics have to demonstrate why the action that was taken had no impact and why outcome Y would have happened regardless. In all the years since 2000 I have seen only unsubstantiated assertion and reference to those countries, industries and sectors where Y2K was not going to be a signficant problem anyway. The critics always ignore the sectors where there would have been massive damage.
An academic’s flawed perspective
This quote from Anthony Finkelstein, professor of software systems engineering at University College London, on the BBC website, is typical of the critics’ reasoning.
”The reaction to what happened was that of a tiger repellent salesman in Golders Green High Street,” says Finkelstein. ‘No-one who bought my tiger repellent has been hurt. Had it not been for my foresight, they would have.’ “
The analogy is presumably flippant and it is entirely fatuous. There were no tigers roaming the streets of suburban London. There were very significant problems with computer systems. Professor Finkelstein also used the analogy back in 2000 (PDF, opens in new tab).
In that paper he made a point that revealed he had little understanding of how dates were being processed in commercial systems.
”In the period leading up to January 1st those who had made dire predictions of catastrophe proved amazingly unwilling to adjust their views in the face of what was actually happening. A good example of this was September 9th 1999 (9/9/99). On this date data marked “never to expire” (realised as expiry 9999) would be deleted bringing major problems. This was supposed to be a pre-shock that would prepare the way for the disaster of January 1st. Nothing happened. Now, if you regarded the problem as a serious threat in the first place, this should surely have acted as a spur to some serious rethinking. It did not.”
I have never seen a date stored in the way Finkelstein describes, 9th September 1999 being held as 9999. If that were done there would be no way to distinguish 1st December 2014 from 11th February 2014. Both would be 1122014. Dates are held either in the form 090999, with leading zeroes so the dates can be interpreted correctly, or with days, months and years in separate sub-fields for simpler processing. Programmers who flooded date fields with the integer 9 would have created 99/99/99, which could obviously not be interpreted as 9th September 1999.
Anyway, the main language of affected applications was Cobol, and the convention was for programmers to move “high values”, i.e. the highest possible value the compiler could handle, into the field rather than nines. “High values” doesn’t translate into any date. Why doesn’t Finkelstein know this sort of basic thing if he’s setting himself up as a Y2K expert? I never heard any concern about 9/9/99 at the time, and it certainly never featured in our planning or work. It is a straw man, quite irrelevant to the main issue.
In the same paper from 2000 Finkelstein made another claim that revealed his lack of understanding of what had actually been happening.
September 9th 1999 is only an example. Similar signs should have been evident on January 1st 1999, the beginning of the financial year 99-00, December 1st, and so on. Indeed assuming, as was frequently stated, poor progress had been made on Y2K compliance programmes we would have anticipated that such early problems would be common and severe. I see no reason to suppose that problems should not have been more frequent (or at any rate as frequent) in the period leading up to December 31st 1999 than afterwards given that transactions started in 1999 may complete in 2000, while after January 1st new transactions start and finish in the new millennium.
Finkelstein is entirely correct that the problem would not have suddenly manifested itself in January 2000, but he writes as if this is an insight the practitioners lacked at the front line. At General Accident the first critical date that we had to hit was the middle of October 1998, when renewal invitations for the first annual insurance contracts extending past December 1999 would be issued. At various points over the next 18 months until the spring of 2000 all the other applications would hit their trigger dates. Everything of significance had been fixed, tested and re-implemented by September 1999.
We knew that timetable because it was our job to know it. We were in trouble not because time was running out till 31/12/1999, but because we had little time before 15/10/1998. We made sure we did the right work at the right time so that all of the business critical applications were fixed in time. Finkelstein seems unaware of what was happening. A massed army of technical staff were dealing with a succession of large waves sweeping towards them over a long period, rather than a single tsunami at the millennium.
Academics like Finkelstein have a deep understanding of the technology and how it can, and should be used, but this is a different matter from knowing how it is being applied by practitioners acting under extreme pressure in messy and complex environments. These practitioners aren’t doing a bad job because of difficult conditions, lack of knowledge and insufficient expertise. They are usually doing a good job, despite those difficult conditions, drawing on vast experience and deep technical knowledge.
Comments such as those of Professor Finkelstein betray a lack of respect for practitioners, as if the only worthwhile knowledge is that possessed by academics.
What I did in the great Y2K “scare”
Let me tell you why I was recruited as a Y2K test manager by IBM. I had worked as a computer auditor for General Accident. A vital aspect of that role had been to understand how all the different business critical applications fitted together, so that we could provide an overview to the business. We could advise on the implications and risks of amending applications, or building new ones to interface with the existing applications.
Shortly before General Accident’s Y2K programme kicked off I was transferred to IBM under an outsourcing deal. General Accident wanted a review performed of a vital back office insurance claims system. The review had to establish whether the application should be replaced before Y2K, or converted. Senior management asked IBM that I should perform the review because I was considered the person with the deepest understanding of the business and technical issues. The review was extremely urgent, but it was delayed by a month till I had finished my previous project.
I explained in the review exactly why the system was business critical and how it was vital to the company’s reserving, and therefore the production of the company accounts. I explained how the processing was all date dependent, and showed how and when it would fail. If the system was unavailable then the accountants and premium setters would be flying blind, and the external auditors would be unable to sign off the company accounts. The risks involved in trying to replace the application in the available time were unacceptable. The best option was therefore to make the application Y2K compliant. This advice was accepted.
As soon as I’d completed the review IBM moved me into a test management position on Y2K, precisely because I had all the business and technical experience to understand how eveything fitted together, and what the implications of Y2K would be.
I looked at the plans drawn up by the Y2K architects for my area, which included the complex finance systems on which I had been working. The plans were a hopelessly misleading over-simplification. There were only three broad systems defined, covering 1,175 modules. I explained that it was nonsense, but I couldn’t say for sure what the right answer was, just that it was a lot more.
Working over a weekend I wrote a suite of SAS programs to crawl through the production libraries, schedules, datasets and access control records to establish all the dependencies, relationships and outputs. I drew up an overview that identified 20 separate interfacing applications with 3,000 modules. This was for just one of the two areas I was responsible for, though the other one was simpler, neater, and more clearly defined.
For the first time we had a good understanding of the inventory, and which assets depended on each other. However, this was a shock to management because it had already been accepted that there would not be enough time to test the lower number thoroughly. Although I was nominally only the test manager I drew up the conversion strategy and timetable for all the applications within my remit, based on my accumulated experience and the new knowledge we’d derived from the inventory.
An insurance company’s processing is heavily date dependent. Premiums are earned on a daily basis, with the appropriate proportion being refunded if a policy is cancelled mid-term. Claims are paid only if the appropriate cover is in place on the date that the incident occurred. Income and expenditure might be paid on a certain date, but then spread over many years. If the date processing doesn’t work then the company can’t take in money, or pay it out. It cannot survive. The processing is so complex that individual errors in production often require lengthy investigation and fixing, and then careful testing. The notion that a “fix on failure” response to Y2K would have worked is risible.
We fixed the applications, taking a careful, triaged risk-based approach. The most date sensitive programs within the most critical applications received the most attention. Some applications were triaged out of sight. For these, “fix on failure” was appropriate.
We started by running unconverted applications with 1995 and 1996 dates to provide a baseline for comparison. 1996 was the last leap year of the 20th century. 2000 was also a leap year, and this was also a Y2K concern. There was a widespread belief in IT that 2000 was an exception. Century years, like 1900, are not leap years, but there is an exception to the exception. Century years that are divisible by 400 are leap years, so this had to be thoroughly checked and tested.
We rolled the input files forward four and eight years, then ran the converted applications across the end of 1999, into 2000 and again for 2003 to 2004. The dates in the output files were then set back four or eight years respectively and the results should have been identical to the 1996 output. The management of all this was complicated by the need for continuing “business as usual” changes to these applications while we were doing the Y2K work. Co-ordination of all these different versions of the same programs was a huge challenge. Outsiders have no idea of the complexities of configuration and release management in a big IT operation.
These simulations exposed many more problems not just with our code, but also with obscure elements of the operating systems and all the utility and housekeeping routines and tools. Investigating the causes of the discrepancies was time consuming and required deep technical expertise. Was the discrepancy a genuine error that needed to be fixed, or a complicaation we had not allowed for in our testing?
In these test runs we overrode the mainframe system date within the test runs. In the final stage of testing we went a step further. We booted up a mainframe LPAR (logical partition) to run with the future dates. I managed this exercise. We had a corner of the office with a sign saying “you are now entering 2000”, and everything was done with future dates. This exercise flagged up further problems with code that we had been confident would run smoothly and also flushed out problems with the way our systems interacted with the operating system. It exposed how much we didn’t know that we didn’t know!Y2K was a fascinating time in my career because I was at a point that I now recognise as a sweet spot. I was still sufficiently technically skilled to do anything that my team members could do, even being called on to fix overnight production problems. However, I was sufficiently confident, experienced and senior to be able to give presentations to the most senior managers explaining problems and what the appropriate solutions would be.
For these reasons I know what I’m talking about when I write that Y2K was a huge problem that had to be tackled. The UK’s financial sector would have suffered a massive blow if we had not fixed the problem. I can’t say how widespread the damage might have been, but I do know it would have been appalling.
My personal millennium experience
When I finished with Y2K in September 1999, at the end of the future mainframe exercise, at the end of a hugely pressurised 30 months, I negotiated seven weeks leave and took off to Peru. IBM could be a great employer at times! My job was done, and I knew that General Accident, or CGU as it had evolved into by then, would be okay. There would inevitably be a few glitches, but then there always are in IT.I was so relaxed about Y2K that on my return from Peru it was the least of my concerns. There was much more interesting stuff going on in my life. I got engaged in December 1999, and on 31st December Mary and I bought our engagement and wedding rings. That night we were at a wonderful party with our friends, and after midnight we were on Perth’s North Inch to watch the most spectacular fireworks display I’ve ever seen. 1st January 2000? It was a great day that I’ll always remember happily. It was far from being a disaster, and that was thanks to people like me.
PS – I have written a follow up article explaining why “fix on failure” was based on an infantile view of software failure. We were less concerned about the possibility of business critical systems crashing – we were confident we could stop that – than the very real danger that they would keep running but producing inaccurate financial output that might be hard to detect.