“A one-off human error”?

This news story in the Guardian grabbed my attention. The Nationwide Building Society mistakenly processed 704,426 payments a second time.

The Independent carried a less detailed report on the story, but it contained a quote that irked me (my emphases).

The society instead blamed an “inputting” error by an operative at its Swindon HQ. The phantom transactions were removed from customers accounts overnight, the bank said.

Jenny Groves, divisional director for customer experience said: “Nationwide wishes to apologise to those customers affected by an issue which has affected some of our debit card customers.”

She said those put into the red would have all charges “refunded in full and any costs associated with this error will be reimbursed in full. None of our customers will suffer financial loss as a result of this one-off error“.

Wow! This is 2012, and a big bank is making excuses that didn’t wash 30 years ago. I’ve worked extensively with big, batch financial systems. Here are some basic, utterly fundamental precepts that were well known by developers before I even knew what a computer was.

  • People screw up. Sometimes they do it in ways you expect, often in ways that surprise you. The only certainty is that they screw up.

  • You process every payment accurately, no exceptions.

  • You never, ever, process payments twice. It is a big deal. It’s not just about keeping your job, or staying out of jail. It’s about self respect. It’s about going to sleep at night knowing you’re a competent professional, not an irresponsible cowboy who gets it right only some of the time.

  • The user requirements will not state every requirement that is absolute and non-negotiable. Some requirements are so fundamental and essential that the users will assume that “they go without saying”. If such requirements do not appear in specifications then you will look stupid if you subsequently pretend that they didn’t matter, or that you believe the users did not really require them. Processing payments accurately, once, and once only falls into this category.

  • It is the system designers’ responsibility to build these unstated, fundamental requirements into the application, even if the business analysts missed them.

  • It is the testers’ responsibility to test the application against these unstated, fundamental requirements.

All this means that financial applications need carefully designed controls to ensure that the right things always happen and the wrong things never do. It means that the application needs built in checks to detect these “one off human errors”. The techniques are ancient, at least in computing, and maybe that’s part of the problem. They’re boring, pedantic old-school stuff.

The main techniques are control files to keep track of files as they are being processed, hash counts and record counts to show that all records have been processed, and file version numbers so that the application can check that the right files are being processed and being processed only once.

These techniques are boring and fiddly, but they work. Unfortunately they frequently trip up test runs. Control files and version numbers have to be reset after a run is halted. It’s easy to lose track and have to explain that the failure was a embarrassing test setup problem, rather than a genuine defect.

It’s much simpler to forget about these controls, or to switch them off for testing, or even switch them off in live running (it happens) when they complicate restarts after problems.

I said earlier that testers have a responsibility to test unstated, fundamental requirements. Actually, that was a slightly tricky one. Of course it is perfectly true, but sadly some project managers, and even whole organisations, prefer to put pressure on testers to script tests only against written requirements.

If you are testing a financial payment application and you’re not testing to see if every payment is processed accurately, once, and only once then you’re not really testing. Such ”testing” is an embarrassment to the testing profession.

Organisations that skimp on effective testing, that don’t understand the value of thoughtful, risk-based controls, that blame “human error” when there is a management or systemic failure are placing their customers and reputation at risk. They are inviting humiliating press coverage and they deserve it.

I had to get that off my chest. People screw up. Human error is inevitable. Testers have to show how it can happen. It’s so much less embarrassing to read it in a test report than a national newspaper. That’s all.

Advertisements

10 thoughts on ““A one-off human error”?

  1. As I said, testing like that is an embarrassment to our profession. It’s a disgrace. What makes it worse is that the sort of management that insists on it likes to pretend that it’s a professional approach. It’s a con, fake testing as James Bach (I think) called it.

    It’s certainly interesting that we’re seeing more of these shambles. Every time I read one of these stories I wonder about what was really going on, and the root cause is rarely what the press reports.

  2. I absolutely agree: shameful.
    Though designers and testers are usually constrained (or consider that they are) by dates, resources, costs, etc., they cannot use such restrictions as excuses to fail in introducing basic (and traditional) controls in applications.

  3. I agree about the constraints. Testers should make sure that management knows about the consequences if these constraints prevent the testers from testing as thoroughly as they would like. It is never the testers’ responsibility to decide if an application is fit to go live, and management must be presented with a full picture so that they can make the decision in the knowledge of the risk they are running. You are right about constraints not being an excuse for failing to run basic tests. The most important requirements might not be documented for the simple reason that they are so obvious.

  4. You allude to, but don’t state, a very basic implicit requirement that teams sometimes miss:

    Except in extraordinary circumstances, batch jobs must be restartable. Duh!

    Of course, then you have to restart them when needed. A few years ago, one of the big Canadian banks hit the news with an error that sounds similar to the one you’re ranting about. Confronted with a failed batch job in the middle of the night, an operator reran the job instead of restarting it. All transactions to that point were doubled — debits and credits.

    Mayhem ensued, not least because the business decided that if IT could make a mistake like that, then IT couldn’t be trusted to fix the problem with a program. So they hired 200+ temps to make all the fixes over a 2-week period. No mistakes there, right?

  5. Great example Fiona, and that was a really weird response to the problem!

    The ability to restart jobs is exactly the sort of thing that is never likely to be specified in the requirements. Why would anyone expect users to think along the right lines to specify what would seem an utterly obscure technical point? It’s a question of professionalism. IT professionals have to get it right.

    A danger when companies see such disasters is that they fall back on rigid processes and standards that actually make it more likely that testers will test only against the documented requirements. But we’ve been down that particular road, and I’m going to keep banging on about it, just not now!

    • Hi James,
      I identified following problems with implicit requirements:
      1. They come in rather great number.
      2. My team of testers is doing system (integration) tests. Some of them are automated. Some of the implicit tests are hard to test in integration tests. They are perfect candidates for unit tests.
      3. That brings us to developers. In context of our organization, we test implicit requirements by discussing them with our developers. We discuss the implicit requirements list, and missed implicit requirements developers code in our product. We do not have regression tests for those implicit requirements. Why? Because in the context of our organisation, that way of working proved itself in the field, because when developers are pointed out to implicit requirements, they are very good at implementing them.
      4. The other reason is why developers do not consider them in the first place.

      Regards, Karlo.

      • I think your third point is very important. Testers must raise these implicit requirements with the developers. In fact they should be discussing them at the earliest possible stage. Testers should challenge requirements, and these implicit requirements should be raised with the business analysts and the users.

    • Fiona can add her own perspective, but to me a restart means starting the processing at the point where it failed. A rerun means going back to the start of the job, so all files have to be exactly as they were originally at the start of the job. Any updates that were applied in the job before it failed have to be backed out, or whole files have to be restored. A rerun is therefore messy and you have to be very careful, or you can get the sort of mess that Fiona mentioned.

      You need to be clear headed about the design of batch jobs, and equally clear headed about what you are doing when you are fixing problems and either restarting or rerunning. It doesn’t help that “restart” is often used when a partial rerun is being carried out. In fact, I’m quilty of that in my blog!

      I didn’t really want to get into the detailed intricacies of batch job design. My point was that it is the responsibility of everyone involved in developing applications to ensure that ridiculous errors, like payments being applied twice, do not happen. These errors are not inevitable. Nor are they the result of error by individuals. They are the result of a sloppy approach to building and running applications, and that is ultimately a management problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s