I’ve been working on an interesting project at work. It isn’t the normal kind of understand this process help us improve it work, however. I might call it “using statistics for evil” except that the intent isn’t really evil. It’s just that people don’t like what the data is telling them.
The back story is there is a new project X (hmm, that connotes a really secret project, which I don’t think this is, but let’s go with the name anyway) that is being worked on. The project was going along fine until it got to testing. For some reason the test team could not get through the test cases.
The claim they made was that the test environments were unstable and that’s why they couldn’t get work done. To prove their point, they collected detailed start and end times of each outage. The numbers sure looked impressive. Hundreds of hours of outages were reported.
One might stop there and say that the recording of outages is proof in itself that work was impeded. But remember the claim was made only to justify that they couldn’t get their testing done.
Anyway, something seemed fishy about the story to a senior manager who asked me to do some digging. Taking from my new found admiration for Karl Popper, I decided the best way to approach this problem was to exercise this hypothesis looking for negative evidence.
The first thing I did was extract the number of test cases executed per day. Then, I lined them up with the number of hours of outage reported per day. Assuming test cases would be executed at a fairly constant rate, which they are, days with outages should naturally yield less test cases being run. Seems fair, right?
It was a simple correlation test. And the result was a Pearson correlation of 0.151. Essentially, there’s no evidence of a relationship between outages and less work being done. I will fully admit that all I can find is a lack of evidence to support the hypothesis. Statistics is unfriendly that way, in if you find solid evidence you know it’s there, but if you don’t find evidence you don’t know whether the relationship doesn’t exist or just that you couldn’t observe it.
Anyway, I decided the right thing to do was to keep looking for possible proof/disproof of the hypothesis. I looked at whether outages resulted in more test case failures or blockages. They didn’t. I looked at whether outages resulted in more defects during the period being cancelled. They didn’t. My thinking on that one was maybe people were confusing code defects and outages and we’d see a rise in cancelled defect tickets during outage periods.
Then I said, well let’s look at it from a test case duration perspective. Maybe I can see that test cases take longer when they overlap with outages than if they don’t. I separated the cases into two populations, those with outages and those without outages occurring during their run.
The data had both unequal variances and significant skewness which unfortunately leaves me without a good hypothesis test to use. Mann-Whitney assumes equal variances in the data. A 2 sample T can handle unequal variances but assumes a normal distribution. It can handle some non-normality but not as much as I was seeing.
Still, just for fun I tried a 2 sample T assuming unequal variances. It resulted in a p-value of 0.000, a huge amount of certainty that the population of test cases with outages took on average longer than test cases without averages. Visually, looking at the histograms it didn’t look like much to me, but still, it got me a little worked up.
See, I had a report out before the latest experiment and I had shared with the team my lack of evidence for their hypothesis. That really raised the ire of the managers who had set out their position as “it’s not my fault we couldn’t get the job done, it’s the environment.” And I was telling them, and their development partners at the same time, that wasn’t true (or at least I could find no evidence of it).
So, to have data (the result of the 2 sample T test) which supported their view after the fact would mean admitting it. I’m big enough to admit I’m wrong, but something just didn’t add up. How could it be that the rate at which test cases were marked complete stayed constant but the duration of test cases with outages was significantly longer? Then it hit me… I had been momentarily fooled by the same false causation that they had!
Starting out on a sunny morning, imagine that I take a 5 minute drive down to the local grocery store. By the time I get there, it is still sunny and it hasn’t rained. By comparison, let’s imagine that I take a drive across the country. Rather than the trip taking 5 minutes, like it does to the grocery store, it takes a few days. Now, even if the weather was perfect the entire trip, it’d still take a few days to drive across the country. It’s just big. It takes a long time. But, my chances of encountering a rainstorm while driving down to the grocery store having started out on a sunny morning are much much less than if I start out on a multi-day trip across the country. Because of the longer duration, weather regardless, of driving across the country, there are more opportunities for it to rain on me.
And that’s what the team was really seeing. Tests that are naturally long running, and therefore hard to complete in the small window of time they had to actually do the testing, were more likely to overlap in timing with a reported an environment outage. The evironment outage didn’t do anything to really impede progress, but you just weren’t going to see an outage on a short test. The test would be over and done with too soon. So, when I separated test cases into those where there were and were not outages, all the short test cases ended up in the “no outages” bucket and all the naturally longer test cases ended up in the “overlapped with a reported outage” bucket. It looked as if cases were longer because of outages, but in reality, they would have been longer running regardless. The relationship is incidental.
Anyway, satisfied that I had explained away outages, I was prepared to stop this fool’s errand; they insisted that I keep looking. One manager said “I was there with them, I experienced it first hand.” And all I can say in response is that the data doesn’t support the claims you are making. But she kept on saying it. I’m pretty sure she was determined to go down with the ship so to speak.
I realize it’s hard to revise your view of the world in the face of conflicting evidence. Yet there are studies that show that those who are willing to change their world view in light of new data are more successful. Indeed clinging to old unsupported views is bordering on the definition of insanity. Say it all you want, it doesn’t make it true.
Posted by ProcessRants
Posted by ProcessRants 
Posted by ProcessRants