Y2K – why I know it was a real problem

It’s confession time. I was a Y2K test manager for IBM. As far as some people are concerned that means I was party to a huge scam that allowed IT companies to make billions out of spooking poor deluded politicians and the public at large. However, my role in Y2K means I know what I am talking about, so when I saw some recent comment that it was all nothing more than hype I felt the need to set down my first hand experience. At the time, and in the immediate aftermath of Y2K, we were constrained by client confidentiality from explaining what we did, but 15 years on I feel comfortable about speaking out.

Was there a huge amount of hype? Unquestionably.

Was money wasted? Certainly, but show me huge IT programmes where that hasn’t happened.

Would it have been better to do nothing and adopt a “fix on failure” approach? No, emphatically not as a general rule and I will explain why.

There has been a remarkable lack of studies of Y2K and the effectiveness of the actions that were taken to mitigate the problem. The field has been left to those who saw few serious incidents and concluded that this must mean there could have been no serious problem to start with.

The logic runs as follows. Action was taken in an attempt to turn outcome X into outcome Y. The outcome was Y. Therefore X would not have happened anyway and the action was pointless. The fallacy is so obvious it hardly needs pointing out. If action was pointless then the critics have to demonstrate why the action that was taken had no impact and why outcome Y would have happened regardless. In all the years since 2000 I have seen only unsubstantiated assertion and reference to those countries, industries and sectors where Y2K was not going to be a signficant problem anyway. The critics always ignore the sectors where there would have been massive damage.

An academic’s flawed perspective

This quote from Anthony Finkelstein, professor of software systems engineering at University College London, on the BBC website, is typical of the critics’ reasoning.

”The reaction to what happened was that of a tiger repellent salesman in Golders Green High Street,” says Finkelstein. ‘No-one who bought my tiger repellent has been hurt. Had it not been for my foresight, they would have.’ “

The analogy is presumably flippant and it is entirely fatuous. There were no tigers roaming the streets of suburban London. There were very significant problems with computer systems. Professor Finkelstein also used the analogy back in 2000 (PDF, opens in new tab).

In that paper he made a point that revealed he had little understanding of how dates were being processed in commercial systems.

”In the period leading up to January 1st those who had made dire predictions of catastrophe proved amazingly unwilling to adjust their views in the face of what was actually happening. A good example of this was September 9th 1999 (9/9/99). On this date data marked “never to expire” (realised as expiry 9999) would be deleted bringing major problems. This was supposed to be a pre-shock that would prepare the way for the disaster of January 1st. Nothing happened. Now, if you regarded the problem as a serious threat in the first place, this should surely have acted as a spur to some serious rethinking. It did not.”

I have never seen a date stored in the way Finkelstein describes, 9th September 1999 being held as 9999. If that were done there would be no way to distinguish 1st December 2014 from 11th February 2014. Both would be 1122014. Dates are held either in the form 090999, with leading zeroes so the dates can be interpreted correctly, or with days, months and years in separate sub-fields for simpler processing. Programmers who flooded date fields with the integer 9 would have created 99/99/99, which could obviously not be interpreted as 9th September 1999.

Anyway, the main language of affected applications was Cobol, and the convention was for programmers to move “high values”, i.e. the highest possible value the compiler could handle, into the field rather than nines. “High values” doesn’t translate into any date. Why doesn’t Finkelstein know this sort of basic thing if he’s setting himself up as a Y2K expert? I never heard any concern about 9/9/99 at the time, and it certainly never featured in our planning or work. It is a straw man, quite irrelevant to the main issue.

In the same paper from 2000 Finkelstein made another claim that revealed his lack of understanding of what had actually been happening.

September 9th 1999 is only an example. Similar signs should have been evident on January 1st 1999, the beginning of the financial year 99-00, December 1st, and so on. Indeed assuming, as was frequently stated, poor progress had been made on Y2K compliance programmes we would have anticipated that such early problems would be common and severe. I see no reason to suppose that problems should not have been more frequent (or at any rate as frequent) in the period leading up to December 31st 1999 than afterwards given that transactions started in 1999 may complete in 2000, while after January 1st new transactions start and finish in the new millennium.

Finkelstein is entirely correct that the problem would not have suddenly manifested itself in January 2000, but he writes as if this is an insight the practitioners lacked at the front line. At General Accident the first critical date that we had to hit was the middle of October 1998, when renewal invitations for the first annual insurance contracts extending past December 1999 would be issued. At various points over the next 18 months until the spring of 2000 all the other applications would hit their trigger dates. Everything of significance had been fixed, tested and re-implemented by September 1999.

We knew that timetable because it was our job to know it. We were in trouble not because time was running out till 31/12/1999, but because we had little time before 15/10/1998. We made sure we did the right work at the right time so that all of the business critical applications were fixed in time. Finkelstein seems unaware of what was happening. A massed army of technical staff were dealing with a succession of large waves sweeping towards them over a long period, rather than a single tsunami at the millennium.

Academics like Finkelstein have a deep understanding of the technology and how it can, and should be used, but this is a different matter from knowing how it is being applied by practitioners acting under extreme pressure in messy and complex environments. These practitioners aren’t doing a bad job because of difficult conditions, lack of knowledge and insufficient expertise. They are usually doing a good job, despite those difficult conditions, drawing on vast experience and deep technical knowledge.

Comments such as those of Professor Finkelstein betray a lack of respect for practitioners, as if the only worthwhile knowledge is that possessed by academics.

What I did in the great Y2K “scare”

Let me tell you why I was recruited as a Y2K test manager by IBM. I had worked as a computer auditor for General Accident. A vital aspect of that role had been to understand how all the different business critical applications fitted together, so that we could provide an overview to the business. We could advise on the implications and risks of amending applications, or building new ones to interface with the existing applications.

A primary source - my report explaining the problem with a business critical application

A primary source – my report explaining the problem with a business critical application

Shortly before General Accident’s Y2K programme kicked off I was transferred to IBM under an outsourcing deal. General Accident wanted a review performed of a vital back office insurance claims system. The review had to establish whether the application should be replaced before Y2K, or converted. Senior management asked IBM that I should perform the review because I was considered the person with the deepest understanding of the business and technical issues. The review was extremely urgent, but it was delayed by a month till I had finished my previous project.

I explained in the review exactly why the system was business critical and how it was vital to the company’s reserving, and therefore the production of the company accounts. I explained how the processing was all date dependent, and showed how and when it would fail. If the system was unavailable then the accountants and premium setters would be flying blind, and the external auditors would be unable to sign off the company accounts. The risks involved in trying to replace the application in the available time were unacceptable. The best option was therefore to make the application Y2K compliant. This advice was accepted.

As soon as I’d completed the review IBM moved me into a test management position on Y2K, precisely because I had all the business and technical experience to understand how eveything fitted together, and what the implications of Y2K would be. The first thing I did was to write a suite of SAS programs that crawled through the production code libraries, job schedules and job control language libraries to track the relationship between programs, data and schedules. For the first time we had a good understanding of the inventory, and which assets depended on each other. Although I was nominally only the test manager I drew up the conversion strategy and timetable for all the applications within my remit, based on my accumulated experience and the new knowledge we’d derived from the inventory.

An insurance company’s processing is heavily date dependent. Premiums are earned on a daily basis, with the appropriate proportion being refunded if a policy is cancelled mid-term. Claims are paid only if the appropriate cover is in place on the date that the incident occurred. Income and expenditure might be paid on a certain date, but then spread over many years. If the date processing doesn’t work then the company can’t take in money, or pay it out. It cannot survive. The processing is so complex that individual errors in production often require lengthy investigation and fixing, and then careful testing. The notion that a “fix on failure” response to Y2K would have worked is risible.

We fixed the applications, taking a careful, triaged risk-based approach. The most date sensitive programs within the most critical applications received the most attention. Some applications were triaged out of sight. For these, “fix on failure” was appropriate.

We tested the converted applications in simulated runs across the end of 1999, in 2000 and again in 2004. These simulations exposed many more problems not just with our code, but also with all the utility and housekeeping routines and tools. In these test runs we overrode the mainframe system date within the test runs.

In the final stage of testing we went a step further. We booted up a mainframe LPAR (logical partition) to run with the future dates. I managed this exercise. We had a corner of the office with a sign saying “you are now entering 2000”, and everything was done with future dates. This exercise flagged up further problems with code that we had been confident would run smoothly.

December 19th 1999, Mary, her brother Malcolm & I in the snow. Not panicking about Y2K.

December 19th 1999, Mary, her brother Malcolm & I in the snow. Not panicking much about Y2K.

Y2K was a fascinating time in my career because I was at a point that I now recognise as a sweet spot. I was still sufficiently technically skilled to do anything that my team members could do, even being called on to fix overnight production problems. However, I was sufficiently confident, experienced and senior to be able to give presentations to the most senior managers explaining problems and what the appropriate solutions would be.

For these reasons I know what I’m talking about when I write that Y2K was a huge problem that had to be tackled. The UK’s financial sector would have suffered a massive blow if we had not fixed the problem. I can’t say how widespread the damage might have been, but I do know it would have been appalling.

My personal millennium experience

When I finished with Y2K in September 1999, at the end of the future mainframe exercise, at the end of a hugely pressurised 30 months, I negotiated seven weeks leave and took off to Peru. IBM could be a great employer at times! My job was done, and I knew that General Accident, or CGU as it had evolved into by then, would be okay. There would inevitably be a few glitches, but then there always are in IT.

What was on my mind on 31st December 1999

What was on my mind on 31st December 1999

I was so relaxed about Y2K that on my return from Peru it was the least of my concerns. There was much more interesting stuff going on in my life. I got engaged in December 1999, and on 31st December Mary and I bought our engagement and wedding rings. That night we were at a wonderful party with our friends, and after midnight we were on Perth’s North Inch to watch the most spectacular fireworks display I’ve ever seen. 1st January 2000? It was a great day that I’ll always remember happily. It was far from being a disaster, and that was thanks to people like me.

PS – I have written a follow up article explaining why “fix on failure” was based on an infantile view of software failure.

Advertisements

14 thoughts on “Y2K – why I know it was a real problem

  1. Thank you Pilar and Jesper. I was warned by Mary not to turn it into an arrogant rant! I didn’t expect a medal. I was well paid and treated respectfully by IBM and General Accident. However, it was disappointing that so many people sneered at our efforts from a position of ignorance.

  2. People such as Finkelstein make me really, really angry. For perhaps the first time in history the IT profession undertakes a large-scale project successfully and ignorant people claim it’s a scam. I think a lot of them, particularly the press, really did want to see planes fall out of the sky and nuclear reactors explode.

    I ran Nokia’s Y2K project for embedded systems in Pacific and South East Asia – basically all the date-aware electronics that keep their buildings running, such as access control systems, fire alarms, lift controllers, HVAC systems, telephone systems, building management systems etc.

    Between my team and my colleagues in Europe we found more than 300 systems that needed to be upgraded or replaced. Often the impact on business would be substantial – no one is going to climb 30 flights of stairs to get to the office if the lift doesn’t work. And what do you do if the access control system won’t let you in when you get there (every one of those systems failed our tests such that you could not get in)?

    The highest impact failure that we preempted was the climate control system for a mobile phone factory that used to run 24/7 so any lost production could never be made up. The 3 months of downtime would have cost around $30M if we had just waited for January 2000 and hoped for the best.

    • Thanks Steve. That’s a very interesting contribution from a very different environment from the one in which I was working. I later worked with Nokia and I enjoyed the experience. They were very demanding, but they were very clear about what they wanted and were good clients to deal with. That was information security, however, not testing.

  3. It’s astonishing that anyone with Prof Finkekstein’s purported credentials could be so ignorant about real systems that have a real impact on people’s lives.

    I wasn’t a Y2K test manager. Instead, I spent a year as manager of QA in the corporate program office for Y2K at Ontario Hydro – the largest electricity provider in North America. I did the usual QA stuff, like writing a quality manual. Most importantly, I set up an assessment program and led audits of the Y2K projects going on all over the company and at the power plants.

    We knew our work was essential. If we didn’t know it already, we had the Great Ice Storm of January 1998 to tell us what happens when power fails across a huge area in winter in a cold country. I don’t recall how many people died in the ice storm – it was 4 or 6, I think. But the numbers are irrelevant, as is the fact that most of those deaths were from carbon monoxide poisoning: people trying to keep warm and cook their food. People died.

    I don’t know, in the end, how many systems had to be fixed. Some, for sure, including some critical to infrastructure and safe delivery of electricity. Should we have let them fail and then rushed in and fixed them? I don’t think so.

    Perhaps the good professor doesn’t know just how pervasive software is to the infrastructures we all depend on, and did – even in 2000. Everything can be at risk: hospitals, water, government, everything. When electricity fails, emergency generators kick in. Until they run out of gas. Gas has to be pumped, and pumps need power.

    The professor needs to get out of the academy.

    • Thanks Fiona. I’ve just been reading about the Great Ice Storm. The idea that anyone could responsibly have adopted a fix on failure Y2K strategy after that experience is utterly laughable. But it would have been no laughing matter if anyone had listened to these “experts” who pretended to be wise after the event.

  4. Thanks for the link, James. It was instructive to see that I grotesquely under-remembered the number of Ice Storm deaths. The effects were spread over a much larger area, and affected far more people, than I recalled.

  5. A good read.

    May I reply from my perspective – as a support / technician drafted in for auditing purposes in the summer of 98.

    I took up a 3+ month contract for a government agency. The role was for auditing desktop equipment at the agency’s central England offices. All hardware and locally installed software had to be catalogue so that potential issues with Y2K could be identified.

    I was part of a small team. We were kitted out with a laptop, got paid silly money (although it was the going rate for roles we could have had at the time) and were put up in decent hotels. But this did not even last 6 weeks as the whole budget was near enough blown. At the end the hotels were reduced to B&Bs, the hours cut significantly and the team size was reduced too.

    The problem was that the project managers wanted too much information. Rather than just record something like this:

    PC Compaq Presario 2200
    SERIAL NO: XYZ123456
    CPU: Cyrix 180Mhz
    RAM: 16MB
    DISK: 1.6GB Bigfoot HD
    CD: Teac CD-532EA
    O/S: Windows 95
    SERIAL NO:
    SOFTWARE: Microsoft Office …, Harvard Graphics, versions, serial nos.
    NAME: Mr Jones
    DEPT: Accounts
    DIVISION: Construction
    OFFICE: 236
    PHONE: 555-123456

    which did not take too much time to collate, we had to revisit the sites and gather information such as the CD ROM Serial Nos, the dates and other info such as what was printed on the memory sticks (even though the software may have reported the sizes and timings).

    The project managers wanted everything. If they did not know everything they could not cover their behinds if anything went wrong. This recording of everything took just too much time to process as desktops had to be stripped down in order to gather the information. This ultimately led to the budget running out much sooner than estimated.

    From my experience (the desktop side) this was indeed hype. It caused departments to throw stupid money at a problem that was over exaggerated.

    Yes some critical stuff had to be tested but there was no need for such an in depth audit on the desktop. If anything 10% of devices should have had an in-depth audit and assessments made from that.

    ——–

    Y2K Hype Caused Over Supply in Jobs Market
    ———————————————————–
    I remember talking to so many people who got into IT purely because of ‘the Millennium Bug’. They took up training (no doubt government funded) in order to get careers pre and post Y2K. In my experience this totally decimated the IT helpdesk / support market.

    There were just too many permanent and contract staff around in the new Millennium. Rates halved. In 2001 I was earning less than I was in 1991. Then came in IR35 and that totally finished it for me. I gave up contracting.

    • Thanks Stephen. You’ve raised a couple of interesting points I recognise. One of my biggest problems was that some of my client managers had been thoroughly spooked by the whole Y2K thing. They wanted to cover their backsides and they were trying to enforce stricter standards for the changes to business critical systems than they would have applied to “business as usual” work. They wanted every line of code in every application run with realistic volumes of data in three different timeframes to satisfy themselves that it would all be ok. That would have committed us to a totally unworkable schedule. We simply couldn’t have done the crucial work in time. I had to reassure them and talk them into something that would be achievable. That was a big challenge. I had to do just enough of their unworkable strategy to show them proof it couldn’t work, then switch smoothly (and in time) to a workable strategy while providing results that would reassure them. I was successful both in persuading them, and in delivering.

      I guess that a big difference between our Y2K programme and yours is that our programme & project managers adopted a more hands-off approach. They allowed the technical experts, like me, to do the risk assessments, set the strategy and the parameters for the programme, then they built the programme around that.

      Where I was working there was a big shift in balance from permanent to contract staff. Senior management made efforts to keep key permies happy, and happily waved goodbye to less valued people who went off contracting. Meanwhile they shipped in a whole load of contractors (most of them excellent) from Australia, NZ, South Africa, Sri Lanka and India. One of the senior managers was quite candid when he said Y2K had allowed him to lose a load of people he was glad to see the back of. “They’ll struggle to get contracts after Y2K, and they certainly won’t get back here on the permanent payroll”.

  6. I ran the international Y2K service line for Deloitte Consulting for a few years in the 1990s and I can confirm that Y2K certainly was a serious problem. I’m now Professor of IT at Gresham College: https://www.gresham.ac.uk/professorships/it-professorship/ and on April 4 2017 I shall lecture on What Really Happened in Y2K at the Museum of London (free – no booking, 6pm to 7pm, all welcome). The video of my lecture and a long paper and the slides will then be put online here: https://www.gresham.ac.uk/lectures-and-events/what-really-happened-in-y2k.

    We haven’t learnt the right lessons from the successes and failures of Y2K and the current cybersecurity crisis is one consequence.

    • Thanks for that Martyn. I can’t get to your lecture but I will definitely look out for the video and paper. I’ll be particularly interested in your views on the lessons we failed to learn. My personal view is that too many organisations failed to maintain the knowledge they’d painfully acquired building an inventory of their IT assets. That’s just personal and anecdotal, however.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s