An abdication of managerial responsibility?

An abdication of managerial responsibility?

The two recent Boeing 737 MAX crashes have been grimly absorbing for software developers and testers. It seems that the crashes were caused by the MCAS system, which should prevent a stall, responding to false data from a sensor by forcing the planes into steep dives despite the attempts of the pilots to make the planes climb. The MCAS problem may have been a necessary condition for disaster, but it clearly was not sufficient. There were many other factors involved. Most strikingly, it seems that MCAS itself may have been working as specified but there were problems in the original design and the way it interfaces with the sensor and crew.

I have no wish to go into all this in serious detail (yet), but I read an article on the Bloomberg website, “Boeing’s 737 Max software outsourced to $9-an-hour engineers” which contained many sentences and phrases that jumped off the screen at me. These snippets all point towards issues that concern me, that I’ve been talking and writing about recently, or that I’ve been long aware of. I’d like to run through them. I’ll use a brief quote from the Bloomberg article in each section before discussing the implications. All software designers and testers should reflect on these issues.

The commoditization of software development and testing

“Boeing has also expanded a design center in Moscow. At a meeting with a chief 787 engineer in 2008, one staffer complained about sending drawings back to a team in Russia 18 times before they understood that the smoke detectors needed to be connected to the electrical system, said Cynthia Cole, a former Boeing engineer who headed the engineers’ union from 2006 to 2010.

‘Engineering started becoming a commodity’, said Vance Hilderman, who co-founded a company called TekSci that supplied aerospace contract engineers and began losing work to overseas competitors in the early 2000s.”

The threat of testing becoming a commodity has been a long standing concern amongst testers. To a large extent we’re already there. However, I’d assumed, naively perhaps, that this was a route chosen by organisations that could get away with poor testing, in the short term at least. I was deeply concerned to see it happening in a safety critical industry.

To summarise the problem, if software development and testing are seen as commodities, bought and sold on the basis of price, then commercial pressures will push quality downwards. The inevitable pressure sends cost and prices spiralling down to the level set by the lowest cost supplier, regardless of value. Testing is particularly vulnerable. When the value of the testing is low then whatever cost does remain becomes more visible and harder to justify.

There is pressure to keep reducing costs, and if you’re getting little value from testing just about any cost-cutting measure is going to look attractive. If you head down the route of outsourcing, offshoring and increasing commoditization, losing sight of value, you will lock yourself into a vicious circle of poor quality.

Iain McCowatt’s EuroSTAR webinar on “The commoditization of testing” is worth watching.

ETTO – the efficiency-thoroughness trade-off

…the planemakers say global design teams add efficiency as they work around the clock.

Ah! There we have it! Efficiency. Isn’t that a good thing? Of course it is. But there is an inescapable trade-off, and organisations must understand what they are doing. There is a tension between the need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. The idea of ETTO, the efficiency-thoroughness trade-off was was popularised by Erik Hollnagel.

Making the organisation more efficient means it is less likely to achieve its important goals. Pursuing vital goals, such as safety, comes at the expense of efficiency, which eliminates margins of error and engineering redundancy, with potentially dangerous results. This is well recognised in safety critical industries, obviously including air transport. I’ve discussed this further in my blog, “The dragons of the unknown; part 6 – Safety II, a new way of looking at safety”.

Drift into failure

“’Boeing was doing all kinds of things, everything you can imagine, to reduce cost, including moving work from Puget Sound, because we’d become very expensive here,’ said Rick Ludtke, a former Boeing flight controls engineer laid off in 2017. ‘All that’s very understandable if you think of it from a business perspective. Slowly over time it appears that’s eroded the ability for Puget Sound designers to design.’”

“Slowly over time”. That’s the crucial phrase. Organisations drift gradually into failure. People are working under pressure, constantly making the trade off between efficiency and thoroughness. They keep the show on the road, but the pressure never eases. So margins are increasingly shaved. The organisation finds new and apparently smarter ways of working. Redundancy is eliminated. The workers adapt the official processes. The organisation seems efficient, profitable and safe. Then BANG! Suddenly it isn’t. The factors that had made it successful turn out to be responsible for disaster.

“Drifting into failure” is an important concept to understand for anyone working with complex systems that people will have to use, and for anyone trying to make sense of how big organisations should work, and really do work. See my blog “The dragons of the unknown; part 4 – a brief history of accident models” for a quick introduction to the drift into failure. The idea was developed by Sidney Dekker. Check out his work.

Conway’s Law

“But outsourcing has long been a sore point for some Boeing engineers, who, in addition to fearing job losses say it has led to communications issues and mistakes.

This takes me to one of my favourites, Conway’s Law. In essence it states that the design of systems corresponds to the design of the organisation. It’s not a normative rule, saying that this should (or shouldn’t) happen. It merely says that it generally does happen. Traditionally the organisation’s design shaped the technology. Nowadays the causation might be reversed, with the technology shaping the organisation. Conway’s Law was intended as a sound heuristic, never a hard and fast rule.

Conway's Law

a slide from one of my courses

Perhaps it is less generally applicable today, but for large, long established corporations I think it still generally holds true.

I’m going to let you in on a little trade secret of IT auditors. Conway’s Law was a huge influence on the way we audited systems and development projects.

corollary to Conway's Law

another slide from one of my courses

Audits were always strictly time boxed. We had to be selective in how we used our time and what we looked at. Modern internal auditing is risk based, meaning we would focus on the risks that posed the greatest threat to the organisation, concentrating on the areas most exposed to risk and looking for assurance that the risks were being managed effectively.

Conway’s Law guided the auditors towards low hanging fruit. We knew that we were most likely to find problems at the interfaces, and these were likely to be particularly serious. This was also my experience as a test manager. In both jobs I saw the same pattern unfold when different development teams, or different companies worked on different parts of a project.

Development teams would be locked into their delivery schedule before the high level requirements were clear and complete, or even mutually consistent. The different teams, especially if they were in different companies, based their estimates on assumptions that were flawed, or inconsistent with other teams’ assumptions. Under pressure to reduce estimates and delivery quickly each team might assume they’d be able to do the minimum necessary, especially at the interfaces; other teams would pick up the trickier stuff.

This would create gaps at the interfaces, and cries of “but I thought you were going to do that – we can’t possibly cope in time”. Or the data that was passed from one suite couldn’t be processed by the next one. Both might have been built correctly to their separate specs, but they weren’t consistent. The result would be last minute solutions, hastily lashed together, with inevitable bugs and unforeseen problems down the line – ready to be exposed by the auditors.

Splitting the work across continents and suppliers always creates big management problems. You have to be prepared for these. The additional co-ordination, chasing, reporting and monitoring takes a lot of effort. This all poses big problems for test managers, who have to be strong, perceptive and persuasive to ensure that the testing is planned consistently across the whole solution, rather than being segmented and bolted together at a late stage for cursory integration testing.

Outsourcing and global teams don’t provide a quick fix. Without strong management and a keen awareness of the risks it’s a sure way to let serious problems slip through into production. Surely safety critical industries would be smarter, more responsible? I learned all this back in the 1990s. It’s not new, and when I read Bloomberg’s account of Boeing’s engineering practices I swore, quietly and angrily.

Consequences

“During the crashes of Lion Air and Ethiopian Airlines planes that killed 346 people, investigators suspect, the MCAS system pushed the planes into uncontrollable dives because of bad data from a single sensor.

That design violated basic principles of redundancy for generations of Boeing engineers, and the company apparently never tested to see how the software would respond, Lemme said. ‘It was a stunning fail,’ he said. ‘A lot of people should have thought of this problem – not one person – and asked about it.’

So the consequences of commoditization, ETTO, the drift into failure and complacency about developing and testing complex, safety critical systems with global teams all came together disastrously in the Lion Air and Ehtiopian Airlines crashes.

A lot of people should certainly have thought of this problem. As a former IT auditor I thought of this passage by Norman Marks, a distinguished commentator on auditing. Writing about risk-based auditing he said;

A jaw-dropping moment happened when I explained my risk assessment and audit plan to the audit committee of the oil company where I was CAE (Tosco Corp.). The CEO asked whether I had considered risks relating to the blending of gasoline, diesel, and jet fuel.

As it happened, I had — but it was not considered high risk; it was more a compliance issue than anything else. But, when I talked to the company’s executives I heard that when Exxon performed an enterprise-wide risk assessment, this area had been identified as their #1 risk!

Poorly-blended jet fuel could lead to Boeing 747s dropping out of the sky into densely-packed urban areas — with the potential to bankrupt the largest (at that time) company in the world. A few years later, I saw the effect of poor blending of diesel fuel when Southern California drivers had major problems and fingers were pointed at us as well as a few other oil companies.

In training courses, when I’ve been talking about the big risks that keep the top management awake at night I’ve used this very example; planes crashing. In big corporations it’s easy for busy people to obsess about the smaller risks, those that delay projects, waste money, or disrupt day to day work. These problems hit us all the time. Disasters happen rarely and we can lose sight of the way the organisation is drifting into catastrophic failure.

That’s where auditors, and I believe testers too, come in. They should be thinking about these big risks. In the case of Boeing the engineers, developers and testers should have spoken out about the problems. The internal auditors should certainly have been looking out for it, and these are the people who have the organisational independence and power to object. They have to be listened to.

An abdication of management responsibility?

“Boeing also has disclosed that it learned soon after Max deliveries began in 2017 that a warning light that might have alerted crews to the issue with the sensor wasn’t installed correctly in the flight-display software. A Boeing statement in May, explaining why the company didn’t inform regulators at the time, said engineers had determined it wasn’t a safety issue.

‘Senior company leadership,’ the statement added, ‘was not involved in the review.’”

Senior management was not involved in the review. Doubtless there are a host of reasons why they were not involved. The bottom line, however, is that it was their responsibility. I spent six years as an IT auditor. In that time only one of my audits led to the group’s chief auditor using that nuclear phrase, which incidentally was not directed at IT management. A very senior executive was accused of “abdicating managerial responsibility”. The result was a spectacular display of bad temper and attempted intimidation of the auditors. We didn’t back down. That controversy related to shady behaviour at a subsidiary where the IT systems were being abused and frauds had become routine. It hardly compared to a management culture that led to hundreds of avoidable deaths.

One of the core tenets of Safety II, the new way of looking at safety, is that there is never a single, root cause for failure in complex systems. There are always multiple causes, all of them necessary, but none of them sufficient, on their own, for disaster. The Boeing 737-MAX case bears that out. No one person was responsible. No single act led to disaster. The fault lies with the corporate culture as a whole, with a culture of leadership that abdicated responsibility, that “wasn’t involved”.

Advertisements

The dragons of the unknown; part 6 – Safety II, a new way of looking at safety

Introduction

This is the sixth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “Facing the dragons part 1 – corporate bureaucracies”. The second post was about the nature of complex systems, “part 2 – crucial features of complex systems”. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “part 3 – I don’t know what’s going on”.

The fourth post, “part 4 – a brief history of accident models”, looks at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “part 5 – accident investigations and treating people fairly”, looks at weaknesses of the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

This post looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

More safety means less feedback

2017 - safest year in aviation historyIn 2017 nobody was killed on a scheduled passenger flight (sadly that won’t be the case in 2018). That prompted the South China Morning Post to produce this striking graphic, which I’ve reproduced here in butchered form. Please, please look at the original. My version is just a crude taster.

Increasing safety is obviously good news, but it poses a problem for safety professionals. If you rely on accidents for feedback then reducing accidents will choke off the feedback you need to keep improving, to keep safe. The safer that systems become the less data is available. Remember what William Langewiesche said (see part 4).

“What can go wrong usually goes right – and then people draw the wrong conclusions.”

If accidents have become rare, but are extremely serious when they do occur, then it will be highly counter-productive if investigators pick out people’s actions that deviated from, or adapted, the procedures that management or designers assumed were being followed.

These deviations are always present in complex socio-technical systems that are running successfully and it is misleading to focus on them as if they were a necessary and sufficient cause when there is an accident. The deviations may have been a necessary cause of that particular accident, but in a complex system they were almost certainly not sufficient. These very deviations may have previously ensured the overall system would work. Removing the deviation will not necessarily make the system safer.

There might be fewer opportunities to learn from things going wrong, but there’s a huge amount to learn from all the cases that go right, provided we look. We need to try and understand the patterns, the constraints and the factors that are likely to amplify desired emergent behaviour and those that will dampen the undesirable or dangerous. In order to create a better understanding of how complex socio-technical systems can work safely we have to look at how people are using them when everything works, not just when there are accidents.

Safety II – learning from what goes right

Complex systems and accidents might be beyond our comprehension but that doesn’t mean we should just accept that “shit happens”. That is too flippant and fatalistic, two words that you can never apply to the safety critical people.

Safety I is shorthand for the old safety world view, which focused on failure. Its utility has been hindered by the relative lack of feedback from things going wrong, and the danger that paying insufficient attention to how and why things normally go right will lead to the wrong lessons being learned from the failures that do occur.

Safety ISafety I assumed linear cause and effect with root causes (see part 5). It was therefore prone to reaching a dangerously simplistic verdict of human error.

This diagram illustrates the focus of Safety I on the unusual, on the bad outcomes. I have copied, and slight adapted, the Safety I and Safety II diagrams from a document produced by Eurocontrol, (The European Organisation for the Safety of Air Navigation) “From Safety-I to Safety-II: A White Paper” (PDF, opens in new tab).

Incidentally, I don’t know why Safety I and Safety II are routinely illustrated using a normal distribution with the Safety I focus kicking in at two standard deviations. I haven’t been able to find a satisfactory explanation for that. I assume that this is simply for illustrative purposes.

Safety IIIf Safety I wants to prevent bad outcomes, in contrast Safety II looks at how good outcomes are reached. Safety II is rooted in a more realistic understand of complex systems than Safety I and extends the focus to what goes right in systems. That entails a detailed examination of what people are doing with the system in the real world to keep it running. Instead of people being regarded as a weakness and a source of danger, Safety II assumes that people, and the adaptations they introduce to systems and processes, are the very reasons we usually get good, safe outcomes.

If we’ve been involved in the development of the system we might think that we have a good understanding of how the system should be working, but users will always, and rightly, be introducing variations that designers and testers had never envisaged. The old, Safety I, way of thinking regarded these variations as mistakes, but they are needed to keep the systems safe and efficient. We expect systems to be both, which leads on to the next point.

There’s a principle in safety critical systems called ETTO, the Efficiency Thoroughness Trade Off. It was devised by Erik Hollnagel, though it might be more accurate to say he made it explicit and popularised the idea. The idea should be very familiar to people who have worked with complex systems. Hollnagel argues that it is impossible to maximise both efficiency and thoroughness. I’m usually reluctant to cite Wikipedia as a source, but its article on ETTO explains it more succinctly than Hollnagel himself did.

“There is a trade-off between efficiency or effectiveness on one hand, and thoroughness (such as safety assurance and human reliability) on the other. In accordance with this principle, demands for productivity tend to reduce thoroughness while demands for safety reduce efficiency.”

Making the system more efficient makes it less likely that it will achieve its important goals. Chasing these goals comes at the expense of efficiency. That has huge implications for safety critical systems. Safety requires some redundancy, duplication and fallbacks. These are inefficient. Efficiencies eliminate margins of error, with potentially dangerous results.

ETTO recognises the tension between organisations’ need to deliver a safe, reliable product or service, and the pressure to do so at the lowest cost possible. In practice, the conflict in goals is usually fully resolved only at the sharp end, where people do the real work and run the systems.

airline job adAs an example, an airline might offer a punctuality bonus to staff. For an airline safety obviously has the highest priority, but if it was an absolute priority, the only consideration, then it could not contemplate any incentive that would encourage crews to speed up turnarounds on the ground, or to persevere with a landing when prudence would dictate a “go around”. In truth, if safety were an absolute priority, with no element of risk being tolerated, would planes ever take off?

People are under pressure to make the systems efficient, but they are expected to keep the system safe, which inevitably introduces inefficiencies. This tension results in a constant, shifting, pattern of trade-offs and compromises. The danger, as “drift into failure” predicts (see part 4), is that this can lead to a gradual erosion of safety margins.

The old view of safety was to constrain people, reducing variability in the way they use systems. Variability was a human weakness. In Safety II variability in the way that people use the system is seen as a way to ensure the system adapts to stay effective. Humans aren’t seen as a potential weakness, but as a source of flexibility and resilience. Instead of saying “they didn’t follow the set process therefore that caused the accident”, the Safety II approach means asking “why would that have seemed like the right thing to do at the time? Was that normally a safe action?”. Investigations need to learn through asking questions, not making judgments – a lesson it was vital I learned as an inexperienced auditor.

Emergence means that the behaviour of a complex system can’t be predicted from the behaviour of its components. Testers therefore have to think very carefully about when we should apply simple pass or fail criteria. The safety critical community explicitly reject the idea of pass/fail, or the bimodal principle as they call it (see part 4). A flawed component can still be useful. A component working exactly as the designers, and even the users, intended can still contribute to disaster. It all depends on the context, what is happening elsewhere in the system, and testers need to explore the relationships between components and try to learn how people will respond.

Safety is an emergent property of the system. It’s not possible to design it into a system, to build it, or implement it. The system’s rules, controls, and constraints might prevent safety emerging, but they can only enable it. They can create the potential for people to keep the system safe but they cannot guarantee it. Safety depends on user responses and adaptations.

Adaptation means the system is constantly changing as the problem changes, as the environment changes, and as the operators respond to change with their own variations. People manage safety with countless small adjustments.

we don't make mistakesThere is a popular internet meme, “we don’t make mistakes – we do variations”. It is particularly relevant to the safety critical community, who have picked up on it because it neatly encapsulates their thinking, e.g. this article by Steven Shorrock, “Safety-II and Just Culture: Where Now?”. Shorrock, in line with others in the safety critical community, argues that if the corporate culture is to be just and treat people fairly then it is important that the variations that users introduce are understood, rather than being used as evidence to punish them when there is an accident. Pinning the blame on people is not only an abdication of responsibility, it is unjust. As I’ve already argued (see part 5), it’s an ethical issue.

Operator adjustments are vital to keep systems working and safe, which brings us to the idea of trust. A well-designed system has to trust the users to adapt appropriately as the problem changes. The designers and testers can’t know the problems the users will face in the wild. They have to confront the fact that dangerous dragons are lurking in the unknown, and the system has to trust the users with the freedom to stay in the safe zone, clear of the dragons, and out of the disastrous tail of the bell curve that illustrates Safety II.

Safety II and Cynefin

If you’re familiar with Cynefin then you might wonder about Safety II moving away from a focus on the tail of the distribution. Cynefin helps us understand that the tail is where we can find opportunities as well as threats. It’s worth stressing that Safety II does encompass Safety I and the dangerous tail of the distribution. It must not be a binary choice of focusing on either the tail or the body. We have to try to understand not only what happens in the tail, how people and systems can inadvertently end up there, but also what operators do to keep out of the tail.

The Cynefin framework and Safety II share a similar perspective on complexity and the need to allow for, and encourage, variation. I have written about Cynefin elsewhere, e.g. in two articles I wrote for the Association for Software Testing, and there isn’t room to repeat that here. However, I do strongly recommend that testers familiarise themselves with the framework.

To sum it up very briefly, Cynefin helps us to make sense of problems by assigning them to one of four different categories, the obvious, the complicated (the obvious and complicated being related in that problems have predictable causes and resolutions), the complex and the chaotic. Depending on the category different approaches are required. In the case of software development the challenge is to learn more about the problem in order to turn it from a complex activity into a complicated one that we can manage more easily.

Applying Cynefin would result in more emphasis on what’s happening in the tails of the distribution, because that’s where we will find the threats to be avoided and the opportunities to be exploited. Nevertheless, Cynefin isn’t like the old Safety I just because they both focus on the tails. They embody totally different worldviews.

Safety II is an alternative way of looking at accidents, failure and safety. It is not THE definitive way, that renders all others dated, false and heretical. The Safety I approach still has its place, but it’s important to remember its limitations.

Everything flows and nothing abides

Thinking about linear cause and effect, and decomposing components are still vital in helping us understand how different parts of the system work, but they offer only a very limited and incomplete view of what we should be trying to learn. They provide a way of starting to build our understanding, but we mustn’t stop there.

We also have to venture out into the realms of the unknown and often unknowable, to try to understand more about what might happen when the components combine with each other and with humans in complex socio-technical systems. This is when objects become processes, when static elements become part of a flow that is apparent only when we zoom out to take in a bigger picture in time and space.

The idea of understanding objects by stepping back and looking at how they flow and mutate over time has a long, philosophical and scientific history. 2,500 years ago Heraclitus wrote.

“Everything flows and nothing abides. Everything gives way and nothing stays fixed.”

Professor Michael McIntyre (Professor of Atmospheric Dynamics, Cambridge University) put it well in a fascinating BBC documentary, “The secret life of waves”.

“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”

In my next post “part 7 – resilience requires people” I will discuss some of the implications for software testing of the issues I have raised here, in particular how people keep systems going, and dealing with the inevitability of failure. That will lead us to resilience engineering.everything flows