This is the third post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This will be the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).
The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “The dragons of the unknown; part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. This one follows on from part 2, which talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely.
The starting point for a system audit
When we audited a live system the specifications of the requirements and design didn’t matter – they really didn’t. This was because;
- specs were a partial and flawed picture of what was required at the time that the system was built,
- they were not necessarily relevant to the business risks and problems facing the company at the time of the audit,
- the system’s compliance, or failure to comply, with the specs told us nothing useful about what the system was doing or should be doing (we genuinely didn’t care about “compliance”),
- we never thought it was credible that the specs would have been updated to reflect subsequent changes,
- we were interested in the real behaviour of the people using the system, not what the analysts and designers thought they would or should be doing.
It was therefore a complete waste of time in a tightly time boxed audit if we waded through the specs. Context driven testers have been fascinated when I’ve explained that we started with a blank sheet of paper. The flows we were interested in were the things that mattered to the people that mattered.
We would identify a key person and ask them to talk us through the business context of the system, sketching out how it fitted into its environment. The interfaces were where we always expected things to go wrong. The scope of the audit was dictated by the sketches of the people who mattered, not the system documentation.
We might have started with a blank sheet but we were highly rigorous. We used a structured methods modelling technique called IDEF0 to make sense of what we were learning, and to communicate that understanding back to the auditees to confirm that it made sense.
We were constantly asking, “How do you know that the system will do what we want? How will you get the outcomes you need?. What must never happen? How does the system prevent that? What must always happen? How does the system ensure that?” It’s a similar approach to the safety critical idea of always events and never events. It is particularly popular in medical safety circles.
We were dealing with financial systems. Our concern could be summarised as; how do we know that the processing is complete, accurate, authorised and timely? It was almost a mantra; complete, accurate, authorised and timely.
These are all constrained by each other, and informed by the context, i.e. sufficiently accurate for business objectives given the need to provide the information within an acceptable time. We had to understand the current context. Context was everything.
Once we had a good idea of the processes, the outputs, the key risks and the controls that were needed, we would attack the system to see if we could force it to do what it shouldn’t, or prevent it doing what it was required to do. We would try to approach the testing with the mindset of a dishonest or irresponsible user. At that time I had never heard of exploratory testing. Training in that would have been invaluable.
We would also speak to ordinary users and watch them in action. Our interviews, observations, and our own testing told us far more about the system and how it was being used than the formal system documentation could. It also told us more than we could learn from the developers who looked after the systems. They would often be taken by surprise by what we discovered about how users were really working with their systems.
We were always asking questions to help us identify the controls that would give us the right outcomes. This is very similar to the way experts look at safety critical systems. Safety is a control problem, a question of ensuring there are mechanisms or practices in place that will help the system and its users from straying into dangerous territory. System developers cannot know how their systems will be used as part of a complex socio-technical system. They might think they do, but users will always take the system into unknown territory.
“System as imagined” versus “system as found”
The safety critical community makes an important distinction between the system as imagined and the system as found. The imagined system is neat and tidy. It is orderly, without noise, confusion and distraction. Real people are absent, or not meaningfully represented.
A user who is working with a system for several hours a day for years on end will know all about the short cuts, hacks and vulnerabilities that are available. They make the system do things that the designers never imagined. They will understand the gaps in the system, the gaps in the designers’ understanding. The users would then have to use their own ingenuity. These user variations are usually beneficial and help the system work as the business requires. They can also be harmful and provide opportunities for fraud, a big concern in an insurance company. Large insurers receive and pay out millions of pounds a day, with nothing tangible changing hands. They have always been vulnerable to fraud, both by employees and outsiders.
I investigated one careful user who stole over a million pounds, slice by slice, several thousand pounds a week, year after year, all without attracting any attention. He was exposed only by an anonymous tip off. It was always a real buzz working on those cases trying to work out exactly what the culprit had done and how they’d exploited the systems (note the plural – the best opportunities usually exploited loopholes between interfacing systems).
What shocked me about that particular case was that the fraudster hadn’t grabbed the money and run. He had settled in for a long term career looting systems we had thought were essentially sound and free of significant bugs. He was confident that he would never be caught. After piecing together the evidence I knew that he was right. There was nothing in the systems to stop him or to reveal what he had done, unless we happened to investigate him in detail.
Without the anonymous tip from someone he had double crossed he would certainly have got away with it. That forced me to realise that I had very little idea what was happening out in the wild, in the system as found.
The system as found is messy. People are distracted and working under pressure. What matters is the problems and the environment the people are dealing with, and the way they have to respond and adapt to make the system work in the mess.
There are three things you really shouldn’t say to IT auditors. In ascending facepalm order.
“But we thought audit would expect …”.
“But the requirements didn’t specify…”.
“But users should never do that”.
The last was the one that really riled me. Developers never know what users will do. They think they do, but they can’t know with any certainty. Developers don’t have the right mindset to think about what real users will do. Our (very unscientific and unevidenced) rule of thunb was as follows. 10% of people will never steal, regardles of the temptation. 10% will always try to steal, so systems must produce and retain the evidence to ensure they will be caught. The other 80% will be honest so long as we don’t put temptation in their way, so we have to explore the application to find the points of weakness that will tempt users.
Aside from their naivety, in auditors’ eyes, regarding fraud and deliberate abuse of the system, developers, and even business analysts, don’t understand the everyday pressures users will be under when they are working with complex socio-technical systems. Nobody knows how these systems really work. It’s nothing to be ashamed of. It’s the reality and we have to be honest about that.
One of the reasons I was attracted to working in audit and testing, and lost my enthusiasm for working in information security, was that these roles required me to think about what was really going on. How is this bafflingly complex organisation working? We can’t know for sure. It’s not a predictable, deterministic machine. All we can say confidently is that certain factors are more likely to produce good outcomes and others are more likely to give us bad outcomes.
If anyone does claim they do fully understand a complex socio-technical system then one of the following applies.
- They’re bullshitting, which is all too common, and are happy to appear more confident than they have any right to be. Sadly it’s a good look in many organisations.
- They’re deluded and genuinely have no idea of the true complexity.
- They understand only part of the system – probably one of the less complex parts, and they’re ignoring the rest. In fairness, they might have made a conscious decision to focus only on the part that they can understand. However, other people might not appreciate the significance of that qualification, and no-one might spot that the self-professed expert has defined the problem in a way that is understandable but not realistic.
- They did have a good understanding of the system once upon a time, when it was simpler, before it evolved into a complex beast.
It is widely believed that mapmakers in the Middle Ages would fill in the empty spaces with dragons. It’s not true. It’s just a myth, but it is a nice idea. It is a neat analogy because the unknown is scary and potentially dangerous. That’s been picked up by people working with safety critical systems, specifically the resilience engineering community. They use phrases like “jousting with dragons” and “facing the dragons at the borderlands”.
Safety critical experts use this metaphor of dangerous dragons for reasons I have been outlining in this series. Safety critical systems are complex socio-technical systems. Nobody can specify how these systems will behave, what people will have to do to keep them running, running safely. The users will inevitably take these systems into unknown, and therefore dangerous, territory. That has huge implications for safety critical systems. I want to look at how the safety community has responded to the problem of trying to understand why systems can have bad outcomes when they can’t even know how systems are supposed to behave. I will pick that up in later posts in this series.
In the next post I will talk about the mental models we use to try and understand failures and accidents, “part 4 – a brief history of accident models”.