How overjoyed I was when I was approached by a senior manager and his team to help develop a true MBF. They were interested in the end to end quality of the the products they test.
Having gone out to their customers and gotten VOC, their customers told them what quality meant to them. Though there are many things one might consider quality code - maintainability, stability, etc. - our customers overwhelmingly told us one thing: NO DEFECTS!
Great, we knew what our customers didn’t want. Now, what’s the opportunity to have a defect. Oh… well… um… unfortunately that question hasn’t been well answered by the software industry.
I do want to digress and point you towards a presentation I saw given by Gary Gack today regarding measuring software productivity which I thought was very interesting. Productivity and quality both need that same “opportunity” measure to make them meaningful. A good measure of the quality of a product is (defects / opportunities) and a good measure of productivity is (effort / opportunity). In this case, opportunity might mean lines of code or function points or who knows what. Mr. Gack presented the standard case against these types of measures – lots of challenges abound. He argues that instead of measuring the opportunity (size of the work) measure how leanly you do the work. Of course, that fixes productivity but does nothing for quality of code. We’ll come back to that in my next post perhaps.
Anyway, long digression… the problem was just like everyone else, this QA department didn’t know what the opportunity was. So they chose one. GASP! Horrors, you say!?!? I disagree for one simple reason. Rather than avoiding the measurement because they didn’t know what exactly an “opportunity for a defect” might be, they chose to acknowledge that whatever operational definition they came up with would be imperfect AND they would work to counteract its imperfections with additional measures to balance the scorecard.
By balancing, in this case, I don’t mean having measures for cost, quality and speed, but instead having measures that counteract or watch over the potential gaming that could be done to the main measurement.
Up to this point, I am a happy boy. Nay, I am practically shedding tears of joy. Never before in this organization have I seen some senior leader actually take this kind of initiative to try and at least get us in the ballpark when it came to where our quality stood.
So, let’s dissect the proposed measurement and see where it went awry. I’m happy to say up front that they performed a proper analysis of where they ended up and corrected it before moving on.
So, back to the original question… what is the opportunity for a defect? Most people choose lines of code (LOC) or function points. This team started with something they felt confident they could measure – test cases.
I know, I know. Test cases is no measure of opportunity… or is it? In our world, it well may be. We perform testing looking for good coverage over all the code created, so QA examines the requirements, writes scenarios and test cases from those requirements and executes them all. Since QA does not selectively not test (ie perform risk based testing), the number of test cases written probably correlates pretty well to the amount of opportunity. Yes, it is true, a test case is an opportunity to find a defect, not an opportunity to create a defect. It’s a subtle distinction, but important.
Of course, the team took it further. What about the complexity of the project and what about the amount of people involved? Aren’t these things important indicators of how much opportunity there is for a defect as well? After all, just like you probably initially reacted to it, test cases seems like a bad way to measure opportunity. So they added those in to create their opportunity measure called “Weighted Test Cases (WTC)”. WTC is Test Cases * Project Size * Project Complexity. Ignore for a moment how size and complexity are figured out. Ultimately, their output measure of quality for a project would be (Defects / WTC).
Here’s my sanity check for “do I have the right opportunity.” Do a correlation test. That’s the whole thinking behind having the opportunity as part of your measure in the first place, right? If I have more opportunities, then having more defects doesn’t necessarily mean I’m doing worse. Less opportunities, less defects.
After they got their data together for their first go at it, we did some analysis. First we looked at the WTC denominator of this equation. Test Cases * Size * Complexity. Hmm, size… test cases… might these two things be related? I mean, after all, if you have more test cases you probably need to exert more effort to run all those tests.
Indeed they were, and quite strongly. Ok, so either size or test cases doesn’t belong in the denominator. Since test cases was our starting point, we dropped size. And for good measure we dropped complexity as well. Now we had Defects / Test Case. This seemed better.
Next, we went after the numerator. All all defects created equal? No, I’m afraid not. Some defects have a high severity, some a medium and some a low. So maybe weighted defects (defect * severity) is closer to what we want instead?
Sure enough, when we compared the correlations of Defects to Test Cases and Weighted Defects to Test Cases, the Weighted Defects came out with a stronger relationship. Interesting! So it appears that more test cases doesn’t just mean that you’ll get more defects, but you’ll also get more defects of greater severity. This makes sense intuitively. Bigger projects are more complicated and have more chance to make big mistakes. (By the way, that same Gary Gack presentation alluded to something similar in productivity. Larger projects have lower productivity on average than smaller ones. There’s a nonlinear relationship between the two.)
Finally, we arrived at a simpler solution. From Defects / (Test Cases * Complexity * Size) to Weighted Defects / Test Cases.
And lastly, the team added some countermeasures. Why? Well, “test cases” is a proxy for opportunity, but only if the testing process remains stable. If testing gets better (ie, more defects found per test case) then the quality of the code looks worse even though it may not be. If testing gets worse (ie, “hey, we can make the denominator bigger if we write lots of teensy test cases”) then quality would look artificially better. So, the team added another measure – defect containment rate (DCR) – a Capers Jones favorite. By having DCR alongside the quality, if containment went up in proportion to the change in quality, then we’d know that quality remained the same while the appraisal process (testing) had improved.
And on the other side we decided to measure effort / test case executed. Since smaller test cases would take less effort to execute, if we saw a drop off in the effort per case we would know that people were trying to increase the denominator of our main measurement artificially.
Alas, those last two paragraphs have little to do with my takeaway lesson. Ready for this one? Start simple. Even though you can imagine why something so basic as “test cases” isn’t a good proxy for the opportunity to create a defect, you could be wrong.