It’s usually true that our attitudes and beliefs are shaped by our early experiences. That applies to my views on software development and testing. My first experience of real responsibility in development and testing was with insurance financial systems. What I learned and experienced will always remain with me. I have always struggled with some of the tenets of traditional testing, and in particular the metrics that are often used.
There has been some recent discussion on Twitter about Defect Removal Efficiency. It was John Stephenson’s blog that set me thinking once again about DRE, a metric I’d long since consigned to my mental dustbin.
If you’re unfamiliar with the metric it is the number of defects found before implementation expressed as a percentage of all the defects discovered within a certain period of going live (i.e live defects plus development defects). The cut off is usually 90 days from implementation. So the more defects reported in testing and the fewer in live running the higher the percentage, and the higher the quality (supposedly). A perfect application would have no live defects and therefore a DRE score of 100%; all defects were found in testing.
John’s point was essentially that DRE can be gamed so easily that it is worthless. I agree. However, even if testers and developers tried not to manipulate DRE, even if it couldn’t be gamed at all it would still be an unhelpful and misleading metric. It’s important to understand why so we can exercise due scepticism about other dodgy metrics, and flawed approaches to software development and testing.
DRE is based on a view of software development, testing and quality that I don’t accept. I don’t see a world in which such a metric might be useful, and it contradicts everything I learned in my early days as a team leader, project manager and test manager.
Here are the four reasons I can’t accept DRE as a valid metric. There are other reasons, but these are the ones that matter most to me.
Software development is not a predictable, sequential manufacturing activity
DRE implicitly assumes that development is like manufacturing, that it’s a predictable exercise in building a well understood and defined artefact. At each stage of the process defects should be progressively eliminated, till the object is completed and DRE should have reached 95% (or whatever).
You can see this sequential mindset clearly in this article by Capers Jones, “Measuring Defect Potentials and Defect Removal Efficency” (PDF, opens in new tab) from QA Journal in 2008.
“In order to achieve a cumulative defect removal efficiency of 95%, it will be necessary to use approximately the following sequence of at least eight defect removal activities:
• Design inspections
• Code inspections
• Unit test
• New function test
• Regression test
• Performance test
• System test
• External Beta test
To go above 95%, additional removal stages will be needed. For example requirements inspections, test case inspections, and specialized forms of testing such as human factors testing, performance testing, and security testing add to defect removal efficiency levels.”
Working through sequential “removal stages” is not software development or testing as I recognise them. When I was working on these insurance finance systems there was no neat sequence through development with defects being progressively removed. Much of the early development work could have been called proof of concept. It wasn’t a matter of coding to a specification and then unit testing against that spec. We were discovering more about the problem and experimenting to see what would work for our users.
Each of these “failures” was a precious nugget of extra information about the problem we were trying to solve. The idea that we would have improved quality by recording everything that didn’t work and calling it a defect would have been laughable. Yet this is the implication of another statement by Capers Jones in a paper on the International Function Point Users Group website (December 2012), “Software Defect Origins and Removal Methods” (PDF, opens in new tab).
“Omitting bugs found in requirements, design, and by unit testing are common quality omissions.”
So experimenting to learn more about the problem without treating the results as formal defects is a quality omission? Tying up developers and testers in bureaucracy by extending formal defect management into unit testing is the way to better quality? I don’t think so.
Once we start to change the way people work simply so that you can gather data for metrics we are not simply encouraging them to game the system. It is worse than that. We are trying to change reality to fit our ability to describe it. We are pretending we can change the territory to fit the map.
Quality is not an absence of something
My second objection to DRE in principle is quite simple. It misrepresents quality. ”Quality is value to some person” as Jerry Weinberg famously said in his book “Quality Software Management: Systems Thinking”.
The insurance applications we were developing were intended to help our users understand the business and products better so that they could take better decisions. The quality of the applications was a matter of how well they helped our users to do that. These users were very smart and had a very clear idea of what they were doing and what they needed. They would have bluntly and correctly told us we were stupid and trying to confuse matters by treating quality as an absence of defects. That takes me on to my next objection to DRE.
Defects are not interchangeable objects
A defect is not an object. It possesses no qualities except those we choose to grant it in specific circumstances. In the case of my insurance applications a defect was simply something we didn’t understand that required investigation. It might be a problem with the application, or it might be some feature of the real world that we hadn’t known about and which would require us to change the application to handle it.
We never counted defects. What is the point of adding up things I don’t understand or don’t know about? I don’t understand quantum physics and I don’t know off hand what colour socks my wife is wearing today. Adding the two pieces of ignorance together to get two is not helpful.
Our acceptance criteria never mentioned defect numbers. The criteria were expressed in accuracy targets against specific oracles, e.g. we would have to reconcile our figures to within 5% of the general ledger. What was the basis for the 5% figure? Our users knew from experience that 95% accuracy was good enough to let them take significantly better decisions than they could without the application. 100% was an ideal, but the users knew that the increase in development time to try and reach that level of accuracy would impose a significant business cost because crucial decisions would have had to be taken blindfolded while we tried to polish up a perfect application.
If there was time we would investigate discrepancies even within the 5% tolerance. If we went above 5% in testing or live running then that was a big deal and we would have to respond accordingly.
You may think that this was a special case. Well yes, but every project has its own business context and user needs. DRE assumes a standard world in which 95% DRE is necessarily better than 90%. The additional cost and delay of chasing that extra 5% could mean the value of the application to the business is greatly reduced. It all depends. Using DRE to compare the quality of different developments assumes that a universal, absolute standard is more relevant than the needs of our users.
Put simply, when we developed these insurance applications, counting defects added nothing to our understanding of what we were doing or our knowledge about the quality of the software. We didn’t count test cases either!
DRE has a simplistic, standardised notion of time
This problem is perhaps related to my earlier objection that DRE assumes developers are manufacturing a product, like a car. Once it rolls off the production line it should be largely defect free. The car then enters its active life and most defects should be revealed fairly quickly.
That analogy made no sense for insurance applications, which are highly date sensitive. Insurance contracts might be paid for up front, or in instalments, but they earn money on a daily basis. At the end of the contract period, typically a year, they have to be renewed. The applications consist of different elements performing distinct roles according to different timetables.
DRE requires an arbitrary cut off beyond which you stop counting the live defects and declare a result. It’s usually 90 days. Applying a 90 day cut-off for calculating DRE and using that as a measure of quality would have been ridiculous for us. Worse, if that had been a measure for which we were held accountable it would have distorted important decisions about implementation. With new insurance applications you might convert all the data from the old application when you implement the new one. Or you might convert policies as they come up for renewal.
Choosing the right tactics for conversion and implementation was a tricky exercise balancing different factors. If DRE with a 90 day threshold were applied then different tactics would give different DRE scores. The team would have a strong incentive to choose the approach that would produce the highest DRE score, and not necessarily the one that was best for the company.
Now of course you could tailor the way DRE is calculated to take account of individual projects, but the whole point of DRE is that people who should know better want to make comparisons across different projects, organisations and industries and decide which produces greater quality. Once you start allowing for all these pesky differences you undermine that whole mindset that wants to see development as a manufacturing process that can be standardised.
DRE matters – for the wrong reasons
DRE might be flawed beyond redemption but metrics like that matter to important people for all the wrong reasons. The logic is circular. Development is like manufacturing, therefore a measure that is appropriate for manufacturing should be adopted. Once it is being used to beat up development shops who score poorly they have an incentive to distort their processes to fit the measure. You have to buy in the consultancy support to adapt the way you work. The flawed metric then justifies the flawed assumptions that underpin the metric. It might be logical nonsense, but there is money to be made there.
So DRE is meaningless because it can be gamed? Yes, indeed, but any serious analysis of the way DRE works reveals that it would be a lousy measure, even if everyone tries to apply it responsibly. Even if it were impossible to game it would still suck. It’s trying to redefine reality so we can count it.