The seductive and dangerous V model (2008)

The seductive and dangerous V model (2008)

Testing Experience

This is an expanded version of an article I wrote for the December 2008 edition of Testing Experience, a magazine which is no longer published. I’m moving the article onto my blog from my website, which will be decommissioned soon.

Inevitably the article is dated in some respects, especially where I discuss possible ways for testers to ameliorate the V Model if they are forced to use it. There’s no mention of exploratory testing. That’s simply because my aim in writing the article was to help people understand the flaws of the V Model and how they can work around them in traditional environments that try to apply the model. A comparison of exploratory and traditional scripted testing techniques was far too big a topic to be shoe-horned in here.

However, I think the article still has value in helping to explain why software development and testing took the paths it followed. For the article I drew heavily on reading and research I carried out for my MSc dissertation, an exercice that taught me a huge amount about the history of software development, and why we ended up where we did.

The references in the article were all structured for a paper magazine. There are no hyperlinks and I have not tried to recreate them and check out whether they still work.

The article

The seductive and dangerous V Model

The Project Manager finishes the meeting by firing a question at the Programme Test Manager, and myself, the Project Test Manager.

“The steering board’s asking questions about quality. Are we following a formal testing model?”.

The Programme Test Manager doesn’t hesitate. “We always use the V Model. It’s the industry standard”.

The Project Manager scribbles a note. “Good”.

Fast forward four months, and I’m with the Programme Manager and the test analysts. It’s a round table session, to boost morale and assure everyone that their hard work is appreciated.

The Programme Manager is upbeat. “Of course you all know about the problems we’ve been having, but when you look at the big picture, it’s not so bad. We’re only about 10% behind where we should have been. We’re still on target to hit the implementation date”.

This is a red rag to a bullish test manager, so I jump in. “Yes, but that 10% is all at the expense of the testing window. We’ve lost half our time”.

The Programme Manager doesn’t take offence, and laughs. “Oh, come on! You’re not going to tell me you’ve not seen that before? If you’re unhappy with that then maybe you’re in the wrong job!”.

I’m not in the mood to let it drop. “That doesn’t make it right. There’s always a readiness to accept testing getting squeezed, as if the test window is contingency. If project schedules and budgets overrun, project managers can’t get away with saying ‘oh – that always happens'”.

He smiles and moves smoothly on, whilst the test analysts try to look deadpan, and no doubt some decide that career progression to test management isn’t quite as attractive as they used to think.

Test windows do get squeezed

The Programme Manager was quite right. That’s what happens on Waterfall projects, and saying that you’re using the V Model does nothing to stop that almost routine squeeze.

In “Testing Experience” issue 2, Dieter Arnouts [1] outlined how test management fits into different software development lifecycle models. I want to expand on the V Model, its problems and what testers can do in response.

In the earlier meeting the Programme Test Manager had given an honest and truthful answer, but I wonder if he was actually wholly misleading. Yes, we were using the V Model, and unquestionably it would be the “test model” of choice for most testing professionals, in the UK at least.

However, I question whether it truly qualifies as a “test model”, and whether its status as best practice, or industry standard, is deserved.

A useful model has to represent the way that testing can and should be carried out. A model must therefore be coherent, reasonably precisely defined and grounded in the realities of software development.

Coherence at the expense of precision

If the V Model, as practised in the UK, has coherence it is at the expense of precision. Worldwide there are several different versions and interpretations. At one extreme is the German V-Model, [2], the official project management methodology of the German government. It is roughly equivalent to Prince-2, but more directly relevant to software development. Unquestionably this is coherent and rigorously defined.

Of course, it’s not the V Model at all, not in the sense that UK testers understand it. For one thing the V stands not for the familiar V shaped lifecycle model, but for “vorgehens”, German for “going forwards”. However, this model does promote a V shaped lifecycle, and some seem to think that this is the real V Model, in its pure form.

The US also has a government standard V Model [3], which dates back about 20 years, like its German counterpart. Its scope is rather narrower, being a systems development lifecycle model, but still more far more detailed and more rigorous than most UK practitioners would understand by the V Model.

The understanding that most of us in the UK have of the V Model is probably based on the V Model as it’s taught in the ISEB Foundation Certificate in Software Testing [4]. The syllabus merely describes the Model without providing an illustration, and does not even name the development levels on the left hand side of the V. Nor does it prescribe a set number of levels of testing. Four levels is “common”, but there could be more or less on any project.

This makes sense to an experienced practitioner. It is certainly coherent, in that it is intuitive. It is easy for testers to understand, but it is far from precise. Novices have no trouble understanding the model to the level required for a straightforward multiple choice exam, but they can struggle to get to grips with what exactly the V Model is. If they search on the internet their confusion will deepen. Wikipedia is hopelessly confused on the matter, with two separate articles on the V Model, which fail to make a clear distinction between the German model [5] and the descriptive lifecycle model [6] familiar to testers. Unsurprisingly there are many queries on internet forums asking “what is the V Model?”

There are so many variations that in practice the V Model can mean just about whatever you want it to mean. This is a typical, and typically vague, illustration of the model, or rather, or a version of the V model.

V model 2

It is interesting to do a Google images search on “V Model”. One can see a huge range of images illustrating the model. All share the V shape, and all show an arrow or line linking equivalent stages in each leg. However, the nature of the link is vague and inconsistent. Is it static testing of deliverables? Simple review and sign-off of specifications? Early test planning? Even the direction of the arrow varies. The Wikipedia article on the German V Model has the arrows going in different directions in different diagrams. There is no explanation for this. It simply reflects the confusion in the profession about what exactly the V Model is.

Boiled down to its most basic form, the V Model can mean simply that test planning should start early on, and the test plans should be based on the equivalent level of the left hand development leg. In practice, I fear that that is all it usually does mean.

The trouble is that this really isn’t a worthwhile advance on the Waterfall Model. The V Model is no more than a testing variant of the Waterfall. At best, it is a form of damage limitation, mitigating some of the damage that the Waterfall has on testing. To understand the significance of this for the quality of applications we need to look at the history and defects of the Waterfall Model itself.

Why the Waterfall is bad for testing and quality

In 1970 Royce wrote the famous paper [7] depicting the Waterfall Model. It is a strange example of how perceptive and practical advice can be distorted and misapplied. Using the analogy of a waterfall, Royce set up a straw man model to describe how software developments had been managed in the 1960s. He then demolished the model. Unfortunately posterity has credited him with inventing the model

The Waterfall assumes that a project can be broken up into separate, linear phases and that the output of each phase provides the input for the next phase, thus implying that each would be largely complete before the next one would start. Although Royce did not make the point explicitly, his Waterfall assumes that requirements can, and must, be defined accurately at the first pass, before any significant progress has been made with the design. The problems with the Waterfall, as seen by Royce, are its inability to handle changes, mainly changes forced by test defects, and the inflexibility of the model in dealing with iteration. Further flaws were later identified, but Royce’s criticisms are still valid.

Royce makes the point that prior to testing, all the stages of the project have predictable outputs, i.e. the results of rigorous analysis. The output of the testing stage is inherently unpredictable in the Waterfall, yet this is near the end of the project. Therefore, if the outcome of testing is not what was predicted or wanted, then reworks to earlier stages can wreck the project. The assumption that iteration can be safely restricted to successive stages doesn’t hold. Such reworking could extend right back to the early stages of design, and is not a matter of polishing up the later steps of the previous phase.

Although Royce did not dwell on the difficulty of establishing the requirements accurately at the first pass, this problem did trouble other, later writers. They concluded that not only is it impossible in practice to define the requirements accurately before construction starts, it is wrong in principle because the process of developing the application changes the users’ understanding of what is desirable and possible and thus changes the requirements. Furthermore, there is widespread agreement that the Waterfall is particularly damaging for interactive applications.

The problems of the Waterfall are therefore not because the analysts and developers have screwed up. The problems are inherent in the model. They are the result of following it properly, not badly!

The Waterfall and project management

So if the Waterfall was originally depicted as a straw man in 1970, and if it has been consistently savaged by academics, why is there still debate about it? Respected contemporary textbooks still defend it. Books such as Hallows’ “Information Systems Project Management” [8], a valuable book, frequently used for university courses. The title gives the game away. The Waterfall was shaped by project management requirements, and it therefore facilitates project management. Along with the V model it is neater, and much easier to manage, plan and control than an iterative approach, which looks messy and unpredictable to the project manager.

Raccoon in 1997 [9] used a revealing phrase, describing the Waterfall.

“If the Waterfall model were wrong, we would stop arguing over it. Though the Waterfall model may not describe the whole truth, it describes an interesting structure that occurs in many well-defined projects and it will continue to describe this truth for a long time to come. I expect the Waterfall model will live on for the next one hundred years and more”.

Note the key phrase, “an interesting structure that occurs in many well-defined projects”.

The Waterfall allows project managers to structure projects neatly, and it is good for project plans not applications!

It is often argued that in practice the Waterfall is not applied in its pure form; that there is always some limited iteration and that there is invariably some degree of overlap between the phases. Defenders of the model therefore argue that critics don’t understand how it can be used effectively.

Hallows argues that changes are time-consuming and expensive regardless of the model followed. He dismisses the criticism that the Waterfall doesn’t allow a return to an earlier stage by saying that this would be a case of an individual project manager being too inflexible, rather than a problem with the model.

This is naive. The problem is not solved by better project management, because project management itself has contributed towards the problem; or rather the response of rational project managers to the pressures facing them.

The dangerous influence of project management had been recognised by the early 1980s. Practitioners had always been contorting development practices to fit their project management model, a point argued forcibly by McCracken & Jackson in 1982 [10].

“Any form of life cycle is a project management structure imposed on system development. To contend that any life cycle scheme, even with variations, can be applied to all system development is either to fly in the face of reality or to assume a life cycle so rudimentary as to be vacuous.”

In hindsight the tone of McCracken & Jackson’s paper, and the lack of response to it for years is reminiscent of Old Testament prophets raging in the wilderness. They were right, but ignored.

The symbiosis between project management and the Waterfall has meant that practitioners have frequently been compelled to employ methods that they knew were ineffective. This is most graphically illustrated by the UK government’s mandated use of the PRINCE2 project management method and SSADM development methodology. These two go hand in hand.

They are not necessarily flawed, and this article does not have room for their merits and problems, but they are associated with a traditional approach such as the Waterfall.

The UK’s National Audit Office stated in 2003 [11] that “PRINCE requires a project to be organised into a number of discrete stages, each of which is expected to deliver end products which meet defined quality requirements. Progress to the next stage of the project depends on successfully meeting the delivery targets for the current stage. The methodology fits particularly well with the ‘waterfall’ approach.”

The NAO says in the same paper that “the waterfall … remains the preferred approach for developing systems where it is very important to get the specification exactly right (for example, in processes which are computationally complex or safety critical)”. This is current advice. The public sector tends to be more risk averse than the private sector. If auditors say an approach is “preferred” then it would take a bold and confident project manager to reject that advice.

This official advice is offered in spite of the persistent criticism that it is never possible to define the requirements precisely in advance in the style assumed by the Waterfall model, and that attempting to do so is possible only if one is prepared to steamroller the users into accepting a system that doesn’t satisfy their goals.

The UK government is therefore promoting the use of a project management method partly because it fits well with a lifecycle that is fundamentally flawed because it has been shaped by project management needs rather than those of software development.

The USA and commercial procurement

The history of the Waterfall in the USA illustrates its durability and provides a further insight into why it will survive for some time to come; its neat fit with commercial procurement practices.

The US military was the main customer for large-scale software development contracts in the 1970s and insisted on formal and rigorous development methods. The US Department of Defense (DoD) did not explicitly mandate the Waterfall, but their insistence in Standard DOD-STD-2167 [12] on a staged development approach, with a heavy up-front documentation overhead, and formal review and sign-off of all deliverables at the end of each stage, effectively ruled out any alternative. The reasons for this were quite explicitly to help the DoD to keep control of procurement.

In the 80s the DoD relaxed its requirements and allowed iterative developments. However, it did not forbid the Waterfall, and the Standard’s continued insistence on formal reviews and audits that were clearly consistent with the Waterfall gave developers and contractors every incentive to stick with that model.

The damaging effects of this were clearly identified to the DoD at the time. A report of the Defense Science Board Task Force [13] criticised the effects of the former Standard, and complained that the reformed version did not go nearly far enough.

However, the Task Force had to acknowledge that “evolutionary development plays havoc with the customary forms of competitive procurement, … and they with it.”

The Task Force contained such illustrious experts as Frederick Brooks, Vic Basili and Barry Boehm. These were reputable insiders, part of a DoD Task Force, not iconoclastic academic rebels. They knew the problems caused by the Waterfall and they understood that the rigid structure of the model provided comfort and reassurance that large projects involving massive amounts of public money were under control. They therefore recommended appropriate remedies, involving early prototyping and staged awarding of contracts. They were ignored. Such was the grip of the Waterfall nearly 20 years after it had first been criticised by Royce.

The DoD did not finally abandon the Waterfall till Military Standard 498 (MIL-STD-498) seven years later in 1994, by which time the Waterfall was embedded in the very soul of the IT profession.

Even now the traditional procurement practices referred to by the Task Force, which fit much more comfortably with the Waterfall and the V Model, are being followed because they facilitate control, not quality. It is surely significant that the V Model is the only testing model that students of the Association of Chartered Certified Accountants learn about. It is the model for accountants and project managers, not developers or testers. The contractual relationship between client and supplier reinforces the rigid project management style of development.

Major George Newberry, a US Air Force officer specialising in software acquisition and responsible for collating USAF input to the defence standards, complained in 1995 [14] about the need to deliver mind-numbing amounts of documentation in US defence projects because of the existing standards.

“DOD-STD-2167A imposes formal reviews and audits that emphasize the Waterfall Model and are often nonproductive ‘dog and pony shows’. The developer spends thousands of staff-hours preparing special materials for these meetings, and the acquirer is then swamped by information overload.”

This is a scenario familiar to any IT professional who has worked on external contracts, especially in the public sector. Payments are tied to the formal reviews and dependent on the production of satisfactory documentation. The danger is that supplier staff become fixated on the production of the material that pays their wages, rather than the real substance of the project.

Nevertheless, as noted earlier, the UK public sector in particular, and large parts of the private sector are still wedded to the Waterfall method and this must surely bias contracts against a commitment to quality.

V for veneer?

What is seductive and damaging about the V Model is that it gives the Waterfall approach credibility. It has given a veneer of respectability to a process that almost guarantees shoddy quality. The most damaging aspect is perhaps the effect on usability.

The V Model discourages active user involvement in evaluating the design, and especially the interface, before the formal user acceptance testing stage. By then it is too late to make significant changes to the design. Usability problems can be dismissed as “cosmetic” and the users are pressured to accept a system that doesn’t meet their needs. This is bad if it is an application for internal corporate users. It is potentially disastrous if it is a web application for customers.

None of this is new to academics or test consultants who’ve had to deal with usability problems. However, what practitioners do in the field can often lag many years behind what academics and specialist consultants know to be good practice. Many organisations are a long, long way from the leading edge.

Rex Black provided a fine metaphor for this quality problem in 2002 [15]. After correctly identifying that V Model projects are driven by cost and schedule constraints, rather than quality, Black argues that the resultant fixing of the implementation date effectively locks the top of the right leg of the V in place, while the pivot point at the bottom slips further to the right, thus creating Black’s “ski slope and cliff”.

The project glides down the ski slope, then crashes into the “quality cliff” of the test execution stages that have been compressed into an impossible timescale.

The Waterfall may have resulted in bad systems, but its massive saving grace for companies and governments alike was that they were developed in projects that offered at least the illusion of being manageable! This suggests, as Raccoon stated [9], that the Waterfall may yet survive another hundred years.

The V Model’s great attractions were that it fitted beautifully into the structure of the Waterfall, it didn’t challenge that model, and it just looks right; comfortable and reassuring.

What can testers do to limit the damage?

I believe strongly that iterative development techniques must be used wherever possible. However, such techniques are beyond the scope of this article. Here I am interested only in explaining why the V Model is defective, why it has such a grip on our profession, and what testers can do to limit its potential damage.

The key question is therefore; how can we improve matters when we find we have to use it? As so often in life just asking the question is half the battle. It’s crucial that testers shift their mindset from an assumption that the V Model will guide them through to a successful implementation, and instead regard it as a flawed model with a succession of mantraps that must be avoided.

Testers must first accept the counter-intuitive truth that the Waterfall and V Model only work when their precepts are violated. This won’t come as any great surprise to experienced practitioners, though it is a tough lesson for novices to learn.

Developers and testers may follow models and development lifecyles in theory, but often it’s no more than lip service. When it comes to the crunch we all do whatever works and ditch the theory. So why not adopt techniques that work and stop pretending that the V Model does?

In particular, iteration happens anyway! Embrace it. The choice is between planning for iteration and frantically winging it, trying to squeeze in fixes and reworking of specifications.

Even the most rigid Waterfall project would allow some iteration during test execution. It is crucial that testers ensure there is no confusion between test cycles exercising different parts of the solution, and reruns of previous tests to see whether fixes have been applied. Testers must press for realistic allowances for reruns. One cycle to reveal defects and another to retest is planning for failure.

Once this debate has been held (and won!) with project management the tester should extend the argument. Make the point that test execution provides feedback about quality and risks. It cannot be right to defer feedback. Where it is possible to get feedback early it must be taken.

It’s not the testers’ job to get the quality right. It’s not the testers’ responsibility to decide if the application is fit to implement. It’s our responsibility to ensure that the right people get accurate feedback about quality at the right time. That means feedback to analysts and designers early enough to let them fix problems quickly and cheaply. This feedback and correction effectively introduces iteration. Acknowledging this allows us to plan for it.

Defenders of the V Model would argue that that is entirely consistent with the V Model. Indeed it is. That is the point.

However, what the V Model doesn’t do adequately is help testers to force the issue; to provide a statement of sound practice, an effective, practical model that will guide them to a happy conclusion. It is just too vague and wishy washy. In so far as the V Model means anything, it means to start test planning early, and to base your testing on documents from equivalent stages on the left hand side of the V.

Without these, the V Model is nothing. A fundamental flaw of the V Model is that it is not hooked into the construction stages in the way that its proponents blithely assume. Whoever heard of a development being delayed because a test manager had not been appointed?

“We’ll just crack on”, is the response to that problem. “Once we’ve got someone appointed they can catch up.”

Are requirements ever nailed down accurately and completely before coding starts? In practice, no. The requirements keep evolving, and the design is forced to change. The requirements and design keep changing even as the deadline for the end of coding nears.

“Well, you can push back the delivery dates for the system test plan and the acceptance test plan. Don’t say we’re not sympathetic to the testers!”

What is not acknowledged is that if test planning doesn’t start early, and if the solution is in a state of flux till the last moment, one is not left with a compromised version of the V Model. One is left with nothing whatsoever; no coherent test model.

Testing has become the frantic, last minute, ulcer inducing sprint it always was under the Waterfall and that the V Model is supposed to prevent.

It is therefore important that testers agitate for the adoption of a model that honours the good intentions of the V Model, but is better suited to the realties of development and the needs of the testers.

Herzlich’s W Model

An interesting extension of the V Model is Paul Herzlich’s W Model [16].

Herzlich W model

The W Model removes the vague and ambiguous lines linking the left and right legs of the V and replaces them with parallel testing activities, shadowing each of the development activities.

As the project moves down the left leg, the testers carry out static testing (i.e. inspections and walkthroughs) of the deliverables at each stage. Ideally prototyping and early usability testing would be included to test the system design of interactive systems at a time when it would be easy to solve problems. The emphasis would then switch to dynamic testing once the project moves into the integration leg.

There are several interesting aspects to the W Model. Firstly, it drops the arbitrary and unrealistic assumption that there should be a testing stage in the right leg for each development stage in the left leg. Each of the development stages has its testing shadow, within the same leg.

The illustration shows a typical example where there are the same number of stages in each leg, but it’s possible to vary the number and the nature of the testing stages as circumstances require without violating the principles of the model.

Also, it explicitly does not require the test plan for each dynamic test stage to be based on the specification produced in the twin stage on the left hand side. There is no twin stage of course, but this does address one of the undesirable by-products of a common but unthinking adoption of the V Model; a blind insistence that test plans should be generated from these equivalent documents, and only from those documents.

A crucial advantage of the W Model is that it encourages testers to define tests that can be built into the project plan, and on which development activity will be dependent, thus making it harder for test execution to be squeezed at the end of the project.

However, starting formal test execution in parallel with the start of development must not mean token reviews and sign-offs of the documentation at the end of each stage. Commonly under the V Model, and the Waterfall, test managers receive specifications with the request to review and sign off within a few days what the developers hope is a completed document. In such circumstances test managers who detect flaws can be seen as obstructive rather than constructive. Such last minute “reviews” do not count as early testing.

Morton’s Butterfly Model

Another variation on the V Model is the little known Butterfly Model [17] by Stephen Morton, which shares many features of the W Model.

Butterfly Model

The butterfly metaphor is based on the idea that clusters of testing are required throughout the development to tease out information about the requirements, design and build. These micro-iterations can be structured into the familiar testing stages, and the early static testing stages envisaged by the W Model.

In this model these micro-iterations explicitly shape the development

Butterfly Model micro-iterations

during the progression down the development leg. In essence, each micro-iteration can be represented by a butterfly; the left wing for test analysis, the right wing for specification and design, and the body is test execution, the muscle that links and drives the test, which might consist of more than one piece of analysis and design, hence the segmented wings. Sadly, this model does not seem to have been fully fleshed out, and in spite of its attractions it has almost vanished from sight.

Conclusion – the role of education

The beauty of the W and Butterfly Models is that they fully recognise the flaws of the V Model, but they can be overlaid on the V. That allows the devious and imaginative test manager to smuggle a more helpful and relevant testing model into a project committed to the V Model without giving the impression that he or she is doing anything radical or disruptive.

The V Model is so vague that a test manager could argue with a straight face that the essential features of the W or Butterfly are actually features of the V Model as the test manager believes it must be applied in practice. I would regard this as constructive diplomacy rather than spinning a line!

I present the W and Butterfly Models as interesting possibilities but what really matters is that test managers understand the need to force planned iteration into the project schedule, and to hook testing activities into the project plan so that “testing early” becomes meaningful rather than a comforting and irrelevant platitude. It is possible for test managers to do any of this provided they understand the flaws of the V Model and how to improve matters. This takes us onto the matter of education.

The V Model was the only “model” testers learned about when they took the old ISEB Foundation Certificate. Too many testers regarded that as the end of their education in testing. They were able to secure good jobs or contracts with their existing knowledge. Spending more time and money continuing their learning was not a priority.

As a result of this, and the pressures of project management and procurement, the V Model is unchallenged as the state of the art for testing in the minds of many testers, project managers and business managers.

The V Model will not disappear just because practitioners become aware of its problems. However, a keen understanding of its limitations will give them a chance to anticipate these problems and produce higher quality applications.

I don’t have a problem with testers attempting to extend their knowledge and skills through formal qualifications such as ISEB and ISTQB. However, it is desperately important that they don’t think that what they learn from these courses comprises All You Ever Need to Know About The Only Path to True Testing. They’re biased towards traditional techniques and don’t pay sufficient attention to exploratory testing.

Ultimately we are all responsible for our own knowledge and skills; for our own education. We’ve got to go out there and find out what is possible, and to understand what’s going on. Testers need to make sure they’re aware of the work of testing experts such as Cem Kamer, James Bach, Brian Marick, Lisa Crispin and Michael Bolton. These people have put a huge amount of priceless material out on the internet for free. Go and find it!


[1] Arnouts, D. (2008). “Test management in different Software development life cycle models”. “Testing Experience” magazine, issue 2, June 2008.

[2] IABG. “Das V-Modell”. This is in German, but there are links to English pages and a downloadable version of the full documentation in English.

[3] US Dept of Transportation, Federal Highway Administration. “Systems Engineering Guidebook for ITS”.

[4] BCS ISEB Foundation Certificate in Software Testing – Syllabus (NB PDF download).

[5] Wikipedia. “V-Model”.

[6] Wikipedia. “V-Model (software development)”.

[7] Royce, W. (1970). “Managing the Development of Large Software Systems”, IEEE Wescon, August 1970.

[8] Hallows, J. (2005). “Information Systems Project Management”. 2nd edition. AMACOM, New York.

[9] Raccoon, L. (1997). “Fifty Years of Progress in Software Engineering”. SIGSOFT Software Engineering Notes Vol 22, Issue 1 (Jan. 1997). pp88-104.

[10] McCracken, D., Jackson, M. (1982). “Life Cycle Concept Considered Harmful”, ACM SIGSOFT Software Engineering Notes, Vol 7 No 2, April 1982. Subscription required.

[11] National Audit Office. (2003).”Review of System Development – Overview”.

[12] Department of Defense Standard 2167 (DOD-STD-2167). (1975) “Defense System Software Development”, US Government defence standard.

[13] “Defense Science Board Task Force On Military Software – Report” (extracts), (1987). ACM SIGAda Ada Letters Volume VIII , Issue 4 (July/Aug. 1988) pp35-46. Subscription required.

[14] Newberry, G. (1995). “Changes from DOD-STD-2167A to MIL-STD-498”, Crosstalk – the Journal of Defense Software Engineering, April 1995.

[15] Black, R. (2002). “Managing the Testing Process”, p415. Wiley 2002.

[16] Herzlich, P. (1993). “The Politics of Testing”. Proceedings of 1st EuroSTAR conference, London, Oct. 25-28, 1993.

[17] Morton, S. (2001). “The Butterfly Model for Test Development”. Sticky Minds website.

The dragons of the unknown; part 8 – how we look at complex systems


This is the eighth post in a series about problems that fascinate me, that I think are important and interesting. The series draws on important work from the fields of safety critical systems and from the study of complexity, specifically complex socio-technical systems. This was the theme of my keynote at EuroSTAR in The Hague (November 12th-15th 2018).

The first post was a reflection, based on personal experience, on the corporate preference for building bureaucracy rather than dealing with complex reality, “facing the dragons part 1 – corporate bureaucracies”. Part 2 was about the nature of complex systems. The third followed on from part 2, and talked about the impossibility of knowing exactly how complex socio-technical systems will behave with the result that it is impossible to specify them precisely, “I don’t know what’s going on”.

Part 4 “a brief history of accident models”, looked at accident models, i.e. the way that safety experts mentally frame accidents when they try to work out what caused them.

The fifth post, “accident investigations and treating people fairly”, looked at weaknesses in the way that we have traditionally investigated accidents and failures, assuming neat linearity with clear cause and effect. In particular, our use of root cause analysis, and willingness to blame people for accidents is hard to justify.

Part six “Safety II, a new way of looking at safety” looks at the response of the safety critical community to such problems and the necessary trade offs that a practical response requires. The result, Safety II, is intriguing and has important lessons for software testers.

The seventh post “Resilience requires people” is about the importance of system resilience and the vital role that people play in keeping systems going.

This eighth post is about the way we choose to look at complex systems, the mental models that we build to try and understand them, and the relevance of Devops.

Choosing what we look at

The ideas I’ve been writing about resonated strongly with me when I first read about the safety and resilience engineering communities. What unites them is a serious, mature awareness of the importance of their work. Compared to these communities I sometimes feel as if normal software developers and testers are like children playing with cool toys while the safety critical engineers are the adults worrying about the real world.

The complex insurance finance systems I worked with were part of a wider system with correspondingly more baffling complexity. Remember the comments of Professor Michael McIntyre (in part six, “Safety II, a new way of looking at safety”).

“If we want to understand things in depth we usually need to think of them both as objects and as dynamic processes and see how it all fits together. Understanding means being able to see something from more than one viewpoint.”

If we zoom out for a wider perspective in both space and time we can see that objects which looked simple and static are part of a fluid, dynamic process. We can choose where we place the boundaries of the systems we want to learn about. We should make that decision based on where we can offer most value, not where the answers are easiest. We should not be restricting ourselves to problems that allow us to make definite, precise statements. We shouldn’t be looking only where the light is good, but also in the darkness. We should be peering out into the unknown where there may be dragons and dangers lurking.
drunkard looking under the streetlight
Taking the wider perspective, the insurance finance systems for which I was responsible were essentially control mechanisms to allow statisticians to continually monitor and fine tune the rates, the price at which we sold insurance. They were continually searching for patterns, trying to understand how the different variables played off each other. We made constant small adjustments to keep the systems running effectively. We had to react to business problems that the systems revealed to our users, and to technical problems caused by all the upstream feeding applications. Nobody could predict the exact result of adjustments, but we learned to predict confidently the direction; good or bad.

The idea of testing these systems with a set of test cases, with precisely calculated expected results, was laughably naïve. These systems weren’t precise or accurate in a simple book-keeping sense, but they were extremely valuable. If we as developers and testers were to do a worthwhile job for our users we couldn’t restrict ourselves to focusing on whether the outputs from individual programs matched our expectations, which were no more likely to be “correct” (whatever that might mean in context) than the output.

Remember, these systems were performing massively complex calculations on huge volumes of data and thus producing answers that were not available any other way. We could predict how an individual record would be processed, but putting small numbers of records through the systems would tell us nothing worthwhile. Rounding errors would swamp any useful information. A change to a program that introduced a serious bug would probably produce a result that was indistinguishable from correct output with a small sample of data, but introduce serious and unacceptable error when we were dealing with the usual millions of records.

We couldn’t spot patterns from a hundred records using programs designed to tease out patterns from datasets with millions of records. We couldn’t specify expected outputs from systems that are intended to help us find out about unknown unknowns.

The only way to generate predictable output was to make unrealistic assumptions about the input data, to distort it so it would fit what we thought we knew. We might do that in unit testing but it was pointless in more rigorous later testing. We had to lift our eyes and understand the wider context, the commercial need to compete in the insurance marketplace with rates that were set with reasonable confidence in the accuracy of the pricing of the risks, rather than being set by guesswork, as had traditionally been the case. We were totally reliant on the expertise of our users, who in turn were totally reliant on our technical skills and experience.

I had one difficult, but enlightening, discussion with a very experienced and thoughtful user. I asked her to set acceptance criteria for a new system. She refused and continued to refuse even when I persisted. I eventually grasped why she wouldn’t set firm criteria. She couldn’t possibly articulate a full set of criteria. Inevitably she would only be able to state a small set of what might be relevant. It was misleading to think only in terms of a list of things that should be acceptable. She also had to think about the relationships between different factors. The permutations were endless, and spotting what might be wrong was a matter of experience and deep tacit knowledge.

This user was also concerned that setting a formal set of acceptance criteria would lead me and my team to focus on that list, which would mean treating the limited set of knowledge that was explicit as if it were the whole. We would be looking for confirmation of what we expected, rather than trying to help her shed light on the unknown.

Dealing with the wider context and becoming comfortable with the reality that we were dealing with the unknown was intellectually demanding and also rewarding. We had to work very closely with our users and we built strong, respectful and trusting relationships that ran deep and lasted long. When we hit serious problems, those good relations were vital. We could all work together, confident in each other’s abilities. These relationships have lasted many years, even though none of us still work for the same company.

We had to learn as much as possible from the users. This learning process was never ending. We were all learning, both users and developers, all the time. The more we learned about our systems the better we could understand the marketplace. The more we learned about how the business was working in the outside world the better our fine tuning of the systems.

Devops – a future reminiscent of my past?

With these complex insurance finance systems the need for constant learning dominated the whole development lifecyle to such an extent that we barely thought in terms of a testing phase. Some of our automated tests were built into the production system to monitor how it was running. We never talked of “testing in production”. That was a taboo phrase. Constant monitoring? Learning in production? These were far more diplomatic ways of putting it. However, the frontier between development and production was so blurred and arbitrary that we once, under extreme pressure of time, went to the lengths of using what were officially test runs to feed the annual high level business planning. This was possible only because of a degree of respect and trust between users, developers and operations staff that I’ve never seen before or since.

That close working relationship didn’t happen by chance. Our development team was pulled out of Information Services, the computing function, and assigned to the business, working side by side with the insurance statisticians. Our contact in operations wasn’t similarly seconded, but he was permanently available and was effectively part of the team.

The normal development standards and methods did not apply to our work. The company recognised that they were not appropriate and we were allowed to feel our way and come up with methods that would work for us. I wrote more about this a few years ago in “Adventures with Big Data”.

When Devops broke onto the scene I was fascinated. It is a response not only to the need for continuous delivery, but also to the problems posed by working with increasingly complex and intractable systems. I could identify with so much; the constant monitoring, learning about the system in production, breaking down traditional structures and barriers, different disciplines working more closely together. None of that seemed new to me. These had felt like a natural way to develop the deeply complicated insurance finance systems that would inevitably evolve into creatures as complex as the business environment in which they helped us to survive.

I’ve found Noah Sussman’s work very helpful. He has explicitly linked Devops with the ideas I have been discussing (in this whole series) that have emerged from the resilience engineering and safety critical communities. In particular, Sussman has picked up on an argument that Sidney Dekker has been making, notably in his book “Safety Differently”, that nobody can have a clear idea of how complex sociotechnical systems are working. There cannot be a single, definitive and authoritative (ie canonical) description of the system. The view that each expert has, as they try to make the system work, is valid but it is inevitably incomplete. Sussman put it as follows in his blog series “Software as Narrative”.

“At the heart of Devops is the admission that no single actor can ever obtain a ‘canonical view’ of an incident that took place during operations within an intractably complex sociotechnical system such as a software organization, hospital, airport or oil refinery (for instance).”

Dekker describes this as ontological relativism. The terminology from philosophy might seem intimidating, but anyone who has puzzled their way through a production problem in a complex system in the middle of the night should be able to identify with it. Brian Fay (in “Contemporary Philosophy of Social Science”) defines ontological relativism as meaning “reality itself is thought to be determined by the particular conceptual scheme of those living within it”.

If you’ve ever been alone in the deep of the night, trying to make sense of an intractable problem that has brought a complex system down, you’ll know what it feels like to be truly alone, to be dependent on your own skills and experience. The system documentation is of limited help. The insights of other people aren’t much use. They aren’t there, and the commentary they’ve offered in the past reflected their own understanding that they have constructed of how the system works. The reality that you have to deal with is what you are able to make sense of. What matters is your understanding, your own mental model.

I was introduced to this idea that we use mental models to help us gain traction with intractable systems by David Woods’ work. He (along with co-authors Paul Feltovich, Robert Hoffman and Axel Roesler) introduced me to the “envisaged worlds” that I mentioned in part one of this series. Woods expanded on this in “Behind Human Error” (chapter six), co-written with Sidney Dekker, Richard Cook, Leila Johannesen and Nadine Sarter.

These mental models are potentially dangerous, as I explained in part one. They are invariably oversimplified. They are partial, flawed and (to use the word favoured by Woods et al) they are “buggy”. But it is an oversimplification to dismiss them as useless because they are oversimplified; they are vitally important. We have to be aware of their limitations, and our own instinctive desire to make them too simple, but we need them to get anywhere when we work with complex systems.

Without these mental models we would be left bemused and helpless when confronted with deep complexity. Rather than thinking of these models as attempts to form precise representations of systems it is far more useful to treat them as heuristics, which are (as defined by James Bach, I think), a useful but fallible way to solve a problem or make a decision.

David Woods is a member of Snafucatchers, which describes itself as “a consortium of industry leaders and researchers united in the common cause of understanding and coping with the immense levels of complexity involved in the operation of critical digital services.”

Snafucatchers produced an important report in 2017, “STELLA – Report from the SNAFUcatchers Workshop on Coping With Complexity”. The workshop and report looked at how experts respond to anomalies and failures with complex systems. It’s well worth reading and reflecting on. The report discusses mental models and adds an interesting refinement, the line of representation.the line of representation
Above the line of representation we have the parts of the overall system that are visible; the people, their actions and interactions. The line itself has the facilities and tools that allow us to monitor and manage what is going on below the line. We build our mental models of how the system is working and use the information from the screens we see, and the controls available to us to operate the system. However, what we see and manipulate is not the system itself.

There is a mass of artifacts under the line that we can never directly see working. We see only the representation that is available to us at the level of the line. Everything else is out of sight and the representations that are available to us offer us only the chance to peer through a keyhole as we try to make sense of the system below. There has always been a large and invisible substructure in complex IT systems that was barely visible or understood. With internet systems this has grown enormously.

The green line is the line of representation. It is composed of terminal display screens, keyboards, mice, trackpads, and other interfaces. The software and hardware (collectively, the technical artifacts) running below the line cannot be seen or controlled directly. Instead, every interaction crossing the line is mediated by a representation. This is true as well for people in the using world who interact via representations on their computer screens and send keystrokes and mouse movements.

A somewhat startling consequence of this is that what is below the line is inferred from people’s mental models of The System.

And those models of the system are based on the partial representation that is visible to us above the line.

An important consequence of this is that people interacting with the system are critically dependent on their mental models of that system – models that are sure to be incomplete, buggy (see Woods et al above, “Behind Human Error”), and quickly become stale. When a technical system surprises us, it is most often because our mental models of that system are flawed.

This has important implications for teams working with complex systems. The system is constantly adapting and evolving. The mental models that people use must also constantly be revised and refined if they are to remain useful. Each of these individual models represents the reality that each operator understands. All the models are different, but all are equally valid, as ontological relativism tells us. As each team member has a different, valid model it is important that they work together closely, sharing their models so they can co-operate effectively.

This is a world in which traditional corporate bureaucracy with clear, fixed lines of command and control, with detailed and prescriptive processes, is redundant. It offers little of value – only an illusion of control for those at the top, and it hinders the people who are doing the most valuable work (see “part 1 – corporate bureaucracies”).

For those who work with complex, sociotechnical systems the flexibility, the co-operative teamwork, the swifter movement and, above all, the realism of Devops offer greater promise. My experience with deeply complex systems has persuaded me that such an approach is worthwhile. But just as these complex systems will constantly change so must the way we respond. There is no magic, definitive solution that will always work for us. We will always have to adapt, to learn and change if we are to remain relevant.

It is important that developers and testers pay close attention to the work of people like the Snafucatchers. They are offering the insights, the evidence and the arguments that will help us to adapt in a world that will never stop adapting.

In the final part of this series, part 9 “Learning to live with the unknowable” I will try to draw all these strands together and present some thoughts about the future of testing as we are increasingly confronted with complex systems that are beyond our ability to comprehend.

Frozen in time – grammar and testing standards

This recent tweet by Tyler Hayes caught my eye. “If you build software you’re an anthropologist whether you like it or not.”

It’s an interesting point, and it’s relevant on more than one level. By and large software is developed by people and for people. That is a statement of the obvious, but developers and testers have generally been reluctant to take on board the full implications. This isn’t a simple point about usability. The software we build is shaped by many assumptions about the users, and how they live and work. In turn, the software can reinforce existing structures and practices. Testers should think about these issues if they’re to provide useful findings to the people who matter. You can’t learn everything you need to know from a requirements specification. This takes us deep into anthropological territory.

What is anthropology?

Social anthropology is defined by University College London as follows.

Social Anthropology is the comparative study of the ways in which people live in different social and cultural settings across the globe. Societies vary enormously in how they organise themselves, the cultural practices in which they engage, as well as their religious, political and economic arrangements.

We build software in a social, economic and cultural context that is shaped by myriad factors, which aren’t necessarily conducive to good software, or a happy experience for the developers and testers, never mind the users. I’ve touched on this before in “Teddy Bear Methods“.

There is much that we can learn from anthropology, and not just to help us understand what we see when we look out at the users and the wider world. I’ve long thought that the software development and testing community would make a fascinating subject for anthropologists.

Bureaucracy, grammar and deference to authority

I recently read “The Utopia of Rules – On Technology, Stupidity, and the Secret Joys of Bureaucracy” by the anthropologist David Graeber.
Graeber has many fascinating insights and arguments about how organisations work, and why people are drawn to bureaucracy. One of his arguments is that regulation is imposed and formalised to try and remove arbitrary, random behaviour in organisations. That’s a huge simplification, but there’s not room here to do Graeber’s argument justice. One passage in particular caught my eye.

People do not invent languages by writing grammars, they write grammars — at least, the first grammars to be written for any given language — by observing the tacit, largely unconscious, rules that people seem to be applying when they speak. Yet once a book exists,and especially once it is employed in schoolrooms, people feel that the rules are not just descriptions of how people do talk, but prescriptions for how they should talk.

It’s easy to observe this phenomenon in places where grammars were only written recently. In many places in the world, the first grammars and dictionaries were created by Christian missionaries in the nineteenth or even twentieth century, intent on translating the Bible and other sacred texts into what had been unwritten languages. For instance, the first grammar for Malagasy, the language spoken in Madagascar, was written in the 1810s and ’20s. Of course, language is changing all the time, so the Malagasy spoken language — even its grammar — is in many ways quite different than it was two hundred years ago. However, since everyone learns the grammar in school, if you point this out, people will automatically say that speakers nowadays are simply making mistakes, not following the rules correctly. It never seems to occur to anyone — until you point it out — that had the missionaries came and written their books two hundred years later, current usages would be considered the only correct ones, and anyone speaking as they had two hundred years ago would themselves be assumed to be in error.

In fact, I found this attitude made it extremely difficult to learn how to speak colloquial Malagasy. Even when I hired native speakers, say, students at the university, to give me lessons, they would teach me how to speak nineteenth-century Malagasy as it was taught in school. As my proficiency improved, I began noticing that the way they talked to each other was nothing like the way they were teaching me to speak. But when I asked them about grammatical forms they used that weren’t in the books, they’d just shrug them off, and say, “Oh, that’s just slang, don’t say that.”

…The Malagasy attitudes towards rules of grammar clearly have… everything to do with a distaste for arbitrariness itself — a distaste which leads to an unthinking acceptance of authority in its most formal, institutional form.

Searching for the “correct” way to develop software

Graeber’s phrase “distate for arbitrariness itself” reminded me of the history of software development. In the 1960s and 70s academics and theorists agonised over the nature of development, trying to discover and articulate what it should be. Their approach was fundamentally mistaken. There are dreadful ways, and there are better ways to develop software but there is no natural, correct way that results in perfect software. The researchers assumed that there was and went hunting for it. Instead of seeking understanding they carried their assumptions about what the answer might be into their studies and went looking for confirmation.

They were trying to understand how the organisational machine worked and looked for mechanical processes. I use the word “machine” carefully, not as a casual metaphor. There really was an assumption that organisations were, in effect, machines. They were regarded as first order cybernetic entities whose behaviour would not vary depending on whether they were being observed. To a former auditor like myself this is a ludicrous assumption. The act of auditing an organisation changes the way that people behave. Even the knowledge that an audit may occur will shape behaviour, and not necessarily for the better (see my article “Cynefin, testing and auditing“). You cannot do the job well without understanding that. Second order cybernetics does recognise this crucial problem and treats observers as participants in the system.

So linear, sequential development made sense. The different phases passing outputs along the production line fitted their conception of the organisation as a machine. Iterative, incremental development looked messy and immature; it was just wrong as far as the researchers were concerned. Feeling one’s way to a solution seemed random, unsystematic – arbitrary.

Development is a difficult and complex job; people will tend to follow methods that make the job feel easier. If managers are struggling with the complexities of managing large projects they are more likely to choose linear, sequential methods that make the job of management easier, or at least less stressful. So when researchers saw development being carried out that way they were observing human behaviour, not a machine operating.

Doubts about this approach were quashed by pointing out that if organisations weren’t quite the neat machine that they should be this would be solved by the rapid advance in the use of computers. This argument looks suspiciously circular because the conclusion that in future organisations would be fully machine-like rests on the unproven premise that software development is a mechanical process which is not subject to human variability when performed properly.

Eliminating “arbitrariness” and ignoring the human element

This might all have been no more than an interesting academic sideline, but it fed back into software development. By the 1970s, when these studies into the nature of development were being carried out, organisations were moving towards increasingly formalised development methods. There was increasing pressure to adopt such methods. Not only were they attractive to managers, the use of more formal methods provided a competitive advantage. ISO certification and CMMI accreditation were increasingly seen as a way to demonstrate that organisations produced high quality software. The evidence may have been weak, but it seemed a plausible claim. These initiatives required formal processes. The sellers of formal methods were happy to look for and cite any intellectual justification for their products. So formal linear methods were underpinned by academic work that assumed that formal linear methods were correct. This was the way that responsible, professional software development was performed. ISO standards were built on this assumption.

If you are trying to define the nature of development you must acknowledge that it is a human activity, carried out by and for humans. These studies about the nature of development were essentially anthropological exercises, but the researchers assumed they were observing and taking apart a machine.

As with the missionaries who were codifying grammar the point in time when these researchers were working shaped the result. If they had carried out their studies earlier in the history of software development they might have struggled to find credible examples of formalised, linear development. In the 1950s software development was an esoteric activity in which the developers could call the shots. 20 years later it was part of the corporate bureaucracy and iterative, incremental development was sidelined. If the studies can been carried out a few decades further on then it would have been impossible to ignore Agile.

As it transpired, formal methods, CMM/CMMI and the first ISO standards concerning development and testing were all creatures of that era when organisations and their activities were seriously regarded as mechanical. Like the early Malagasy grammar books they codified and fossilised a particular, flawed approach at a particular time for an activity that was changing rapidly. ISO 29119 is merely an updated version of that dated approach to testing. It is rooted in a yearning for bureaucratic certainty, a reluctance to accept that ultimately good testing is dependent not on documentation, but on that most irrational, variable and unpredictable of creatures – the human who is working in a culture shaped by humans. Anthropology has much to teach us.

Further reading

That is the end of the essay, but there is a vast amount of material you could read about attempts to understand and define the nature of software development and of organisations. Here is a small selection.

Brian Fitzgerald has written some very interesting articles about the history of development. I recommend in particular “The systems development dilemma: whether to adopt formalised systems development methodologies or not?” (PDF, opens in new tab).

Agneta Olerup wrote this rather heavyweight study of what she calls the
Langeforsian approach to information systems design. Börje Langefors was a highly influential advocate of the mechanical, scientific approach to software development. Langefors’ Wikipedia entry describes him as “one of those who made systems development a science”.

This paper gives a good, readable introduction to first and second order cybernetics (PDF, opens in new tab), including a useful warning about the distinction between models and the entities that they attempt to represent.

All our knowledge of systems is mediated by our simplified representations—or models—of them, which necessarily ignore those aspects of the system which are irrelevant to the purposes for which the model is constructed. Thus the properties of the systems themselves must be distinguished from those of their models, which depend on us as their creators. An engineer working with a mechanical system, on the other hand, almost always know its internal structure and behavior to a high degree of accuracy, and therefore tends to de-emphasize the system/model distinction, acting as if the model is the system.

Moreover, such an engineer, scientist, or “first-order” cyberneticist, will study a system as if it were a passive, objectively given “thing”, that can be freely observed, manipulated, and taken apart. A second-order cyberneticist working with an organism or social system, on the other hand, recognizes that system as an agent in its own right, interacting with another agent, the observer.

Finally, I recommend a fascinating article in the IEEE’s Computer magazine by Craig Larman and Victor Basili, “Iterative and incremental development: a brief history” (PDF, opens in new tab). Larman and Basili argue that iterative and incremental development is not a modern practice, but has been carried out since the 1950s, though they do acknowledge that it was subordinate to the linear Waterfall in the 1970s and 80s. There is a particularly interesting contribution from Gerald Weinberg, a personal communication to the authors, in which he describes how he and his colleagues developed software in the 1950s. The techniques they followed were “indistinguishable from XP”.