My adventures with Big Data

The current enthusiasm for Big Data is intriguing, almost as fascinating as the subject itself. I wish there had been a similar level of interest back in the mid and late 1990s when I worked with huge insurance management information (MI) data warehouses as a development team lead, project manager and test manager.

This was highly complex and demanding work, and life would have been easier if more people in IT had had a clearer idea of what we were up to. The trendy work then was all real time database systems and the early web applications. The attitude to our data warehouse work was summed up by a newly arrived manager who was given a briefing about what we were doing; he said, “so, you’re working on batch legacy systems?”

Well, the work was batch, but in financial services that’s often where the really complex, intellectually demanding IT work is done. And yes, we were dealing with old applications, but this was a strategic programme to extend the old applications to add vital new functionality.

”It’s not enough to know we lost money – we have to know why!”

Our mission was to standardise the various sources of MI within the company, pulling them together into a system that could be used both by insurance managers and the statisticians who monitored profitability and set the premiums. This required many new interface applications to take raw data from the source underwriting and claims systems into MI data warehouses for subsequent processing by a new front-end system that the insurance managers would use.

The statisticians would crawl all over the new data warehouses building a detailed understanding of what risks we were facing, and how they should be priced. The managers would look at the results of the predefined analyses that reduced the vast amounts of messy data to clear and simple analyses of profitability.

It is vital for an insurance company that it understands its portfolio so that it not only knows which customers are profitable, but also why. Otherwise the insurer will gradually lose the profitable business to rivals who are better informed and can set appropriate rates. The remaining customers will be the bad risks. Accurate and timely management information is therefore a matter of business survival. “It’s not enough to know we lost money – we have to know why!”

Making the bricks for the data warehouse

All of our insurance systems were designed for processing underwriting and claims. The data was therefore not held in a form suitable for MI. Converting historical transaction data into the right form was a surprisingly difficult and complex job. Basically the reformatting entailed matching the premium income and claims payments with the factors that earned or lost the money; e.g. for a given package of cover we sold to a customer the company earned £x. This information could then be used as the basic building brick for the sophisticated analyses required by the business.

We had to reformat the historical data and also set up feeds that would take the ongoing processing data and convert it.

My first draft of this blog got bogged down in the technicalities of insurance finance. Once you get sucked into trying to explain the significance of the differences between written and earned premiums, and between incurred and occurred claims then it’s hard to know where to stop. Trying to keep it simple just leaves the reader baffled and doesn’t convey the massive practical problems involved in converting the data. Explaining the issues precisely is a boring turn-off.

So I’ve ditched the financial detail and I’ll try to concentrate on the bigger, more interesting issues.

Big Data = big problems

Firstly, and most obviously, Big Data meant big problems. When we started working with files that were 10, 50 even 100 times bigger than the files we were used to it became clear that the old ways wouldn’t work. Run times and disk space allocations had to be carefully calculated. Batch suites had to be very carefully designed. Important though this was it wasn’t our toughest challenge. Our biggest problem by far was testing.

Traditional linear techniques suck

This was the time that I really ran smack into the fact that traditional, linear techniques suck. They suck particularly badly when you’re dealing with a highly uncertain situation. Uncertainty is the reality in software development, and that simple truth was a factor I underestimated massively when I planned and led the first of my data warehousing projects. Build it then test it was a plan for disaster.

The whole point of our development was to provide the business with information that was not otherwise available. If the information could have been provided more easily, by some alternative means, then it would already have been done. There was therefore no readily available oracle against which we could test.

Traditional test scripts were irrelevant. How could we sensibly draw up scripts with predicted results based on our input when we had no real idea of the potential problems? We didn’t know what we didn’t know! I planned the project based on what the source systems should have been doing, what the source data should have been, and I allowed for the problems that we should have been able to foresee. How naïve!

Across time, not just at a point in time

We built the system and only then did we start seriously testing it. Sure, the programmers had done careful unit testing. But what we hadn’t allowed for was that in building a data warehouse that covered a decade of processing we needed accuracy and consistency across time, not just at a particular point in time.

Successive versions of a motor policy might be entirely accurate and consistent with accounts and claims data at a particular point in time. That didn’t necessarily mean that these successive versions were consistent with each other, at least not to the level of detail and accuracy that we required.

Numerous changes had been made to the source systems, none of which had affected the integrity of processing, but all of which had subtle, but cumulatively massive, effects on the integrity of the MI that the data could provide. Also, trivial bugs that might have been ignored, or not even noticed, in the processing system could have a much more significant impact on the potential MI.

We’d always known that accuracy and consistency were crucial, but we hadn’t grasped just how much more complicated and difficult the problem would be when we introduced the extra dimension of time.

The big lesson I learned was that traditional techniques condemned us to building the application in order to find out why it wouldn’t work!

We managed to dig ourselves out of that hole with numerous coding changes, some frantic data cleansing and a ruthlessly dramatic redesign that entailed axing half of the system and replacing it with a cloned, and then adapted, version of the surviving part.

That approach was clearly unacceptable. So for the following MI developments I adopted a more practical, efficient and effective approach. There could be no artificial distinction between the build and the testing. What was required was a form of test-driven development. There were two main strands to that.

Lesson 1 – tester, know your data!

Firstly, before the development could start we had to explore the source system and its data. We had to do it thoroughly. I mean really, obsessively thoroughly, not just quick scans to try and reassure ourselves that our optimistic assumptions were valid.

We would crawl though the source data to understand it, to identify patterns and relationships that we could exploit in testing, and problems that would later screw up the statistical analyses. We had to find the patterns that existed not just horizontally across all the data at a particular moment in time, but also the patterns that unfolded over time.

It was amazing how often the data failed to match the way the system was assumed to work, and how the patterns would appear then evolve over the years. This knowledge was obviously vital for the build work, but it was also priceless for testing.

The lack of readily available test oracles meant that any relationships that held true over time, or over a large number of records, gave us something to hook our testing on. E.g. for a given policy the written premium on an individual transaction bore no necessary relation to the earned premium. It could even be negative. But over the full length of an insurance contract the sum of the written premiums must equal the premium that was earned.

We’d go round in circles learning more and more about the data, applying new insights, trying out new ideas till we had a load of relationships and rules. These rules were a mixture of business rules, rules that could logically be inferred from the data, and possibly quite arbitrary rules imposed by the design of the source systems. Such rules might have been arbitrary and of no business significance, but breaching them would mean we’d done something to the data that we’d not meant to and didn’t understand. We could get guidance, not requirements, from the users to get us started. However, that guidance consisted of what ought to be happening in the present, and was therefore of limited value.

We’d then build these rules into the processing. Basically we’d design the processing around them. In live running these checks would flag up any deviations. Serious discrepancies meant the run would stop and some poor soul would get a phone call in the middle of the night.

Lesson 2 – build it so you can test it

The second strand to the development testing was also tied into the design. It was important that the batch suites were broken up into discrete stages that could be run in isolation with meaningful, testable results at the end of each stage. We could then step slowly through a whole suite, testing the results at each stage. The processing would have been far more efficient if we’d lumped more into each stage, ideally processing each record only once, and doing everything necessary with a single access.

We had our fingers badly burnt when we took that efficient approach with the design of the first application I was talking about. It meant that significant defects could be a nightmare to debug. There was a trade-off between the strain we were imposing on the batch processing window on the one hand and the significant cost of testing, fixing and retesting and even redesigning. Efficiency was important, but obsessing about it was a false economy. Testability had to be the most important factor dictating our designs.

Not real testing?

At the time I didn’t consider that what we were doing was real testing. It was what we had to do in the circumstances. Real testing was all about scripts and test cases, and that was very much the view of the testing specialists at the company. When I actually moved into test management and thought more deeply about what testing meant I realised how wrong I’d been to dismiss our work as “not real testing”, but how right I’d been to insist that we should do what fitted the problem, not what fitted the development and testing standards.

I’ve been leafing through performance appraisals and post-implementation reviews from the period. One appraisal said “the project required considerable business analysis work where James displayed a special aptitude to get to the bottom of complex situations”.

Real testing? I think so!

I’ve written a follow up to this, talking about my experiences investigating frauds when I worked as a computer auditor. This involved crawling through huge datasets, trying to make sense of suspected frauds.

Advertisements

One thought on “My adventures with Big Data

  1. Pingback: Five Blogs – 9 April 2013 | 5blogs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s