Why measurement is necessary

November 17, 2009

This evening, my daughter, in an attempt to be helpful offered to put our dog’s food into his bowl for dinner. She does this every now and again, and usually the mess is kept to a minimum. Regardless, I can hear the difference from across the room when his food goes into our dog’s metal bowl or on the floor.

It isn’t atypical to hear the clatter of a few pieces of his food hit the wood floor on the way and this night was no exception. As my daughter raced back to me with the now empty food scoop in hand, she says to me “I spilled some.” Now, I don’t know what some means to you, but it doesn’t mean the vast majority of the food, right?

Apparently “some” meant exactly that to my daughter. Certainly, as I looked down at my dog’s bowl there was an amount of food in the bowl and another amount outside the bowl on the floor. But “some” is not the word I would have used to describe the amount on the floor. “Most” is the word, if I have to be inexact about it, is the word that I would have used.

I know it’s a dumb example, but this is exactly why we need to measure things. All the words we have to describe portions are inexact. What does a few mean? More than 2 certainly, but is a “few’ deaths as the result of the millions of products you sold 3 people or 300 people or 3000 people? Compared to the whole, even 3000 out of a million might be a few!

The “majority” suffers this issue in news reporting all the time. When the majority of people approve of the President’s performance, it means some number greater than 50% approve. 50.0000001% is a majority. It’s not an overwhelming majority, but it’s a majority technically. And I’ve seen “most” used to mean a simple majority as well, which is crazy, since most clearly means something higher than that. Is 75% most? 80%? 90%? Who knows? The definition is variable.

What about “some”? Officially, it’s just a number greater than 0. Some of the food was outside the bowl. Indeed, not all of the food was outside the bowl, so some is a fair statement. But my version of some and my daughter’s version are really different in this case.

And it’s for this reason that English is an inexact language that true measurement is needed. A proportion would have told a much better story. Not that I would have expected my daughter to say “daddy, I spilled seventy-five percent of the food” but I can expect that from an adult.

Let’s talk real number in business. Put a scale to it – a proportion that is a problem, a count that is a problem, but some real measure of how much is really wrong. “We have some issues with code quality,” after seeing my daughter’s definition of “some” tonight, has a whole new meaning for me.


The difference between knowing and thinking you know

November 8, 2009

I was talking with one of my managers the other day when she presented an interesting point that I thought I’d share. There is a difference between knowing something and thinking you know something. Sure, there is pure ignorance, but it’s not the type of issue we were talking about.

The type of intellectual leap you shouldn’t make is the taking of incomplete knowledge and representing it as complete knowledge. It’s the overextension of a specific situation to being the general problem. For example, we know that some defects were introduced when the code was written years ago and are just appearing now, but that doesn’t mean that our problem is all latent bugs. This is inductive fallacy. All we’ve seen (for those that we can confirm at all) is that some bugs exist since the creation of the code, but this does not make the general statement “our issue is latent bugs” necessarily true.

The list of bad extensions is long, and yet easy to remedy. We should be talking about odds and probabilities and uncertainty and recognizing that it exists with much of what we deal with. For example, Stephen Kan, a researcher from IBM, recognized that test execution doesn’t happen linearly but instead follows and S-shaped curve. When we explored our patterns of test execution we discovered the same thing held true of us. And so, knowing that, we could set forth a realistic pattern of what would have to happen (for a given set of tests and duration to complete the testing) in order for us to be done one time. We tried it out, and found out that we don’t always perfectly follow the curve.

It wasn’t about perfectly following the curve. It was about more-or-less following it. There are essentially bands, plus or minus, around this idealized curve which represent the uncertain area which is normal variation from the perfect shape. As long as you stay within those, we feel comfortable that things are on track. But we represent the uncertainty, not ignore it, because it is there. The knowledge says “testing follows and s-shape”, but we don’t make that an absolute truth, just a general pattern that includes room for representing that we don’t understand everything that may cause it to deviate from that path.

"A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again."

- Alexander Pope, An Essay on Criticism

His quote is not just a warning that knowing a little causes you to do bad things, but that it is necessary for us drink deeply, acknowledge that we do not know everything and seek to figure out what it is we do not know.


100 metrics you mostly shouldn’t bother with

October 27, 2009

Because I tend to focus my comments on software development processes, I subscribe to a number of newsletters just to keep abreast of stuff that other people are writing. Generally, stuff is a rehash of what we’ve already heard – see my commentary on why it seems nobody is doing primary research anymore. Today I got an interesting one which was “100 IT Performance Metrics”.

I’ve decided to rename it to “100 metrics you mostly shouldn’t bother with.” Don’t get me wrong, I love metrics and data but there are just too many issues with this proposal from Mr. Spanos to ignore.

  1. 100 is way too much information. It violates the idea of the critical few things you ought to control. And there are a critical few things which really make most of the difference.
  2. Wrong statistic. Mr. Spanos suggests the average in a lot of places where the median would likely be more appropriate. See numbers 3, 4, 9, 10, 11, etc.
  3. Rather than reporting one useful metric, he recommends many more than necessary. Boxplots, in many cases would better display the data than 3-4 graphs. For example, numbers 3 and 6 (mean and max resolution time) could be built into a single graph which also showed interquartile range, outliers and providing an easy to read month-over-month comparison. Just doing this could cut the number of proposed metrics in half or more.
  4. Stratification hell – by type, by severity. Enough said. If you solve a problem and have fewer incidents the whole number will go down. It is highly unlikely that the resolution of high severity incidents will be replaced in equal volumes by medium severity incidents. Just measure the overall number, once.
  5. Lack of scale. All these measures lack any sense of scale. Unlike a factory, the amount of work an IT shop is doing varies greatly from month to month. Without a normalizing factor, increases (or decreases) in any of these measures might entirely be due to changes in work volume.
  6. Irrelevance. Average contractor cost? Here’s a place where median is a necessity, since one expensive contractor will blow this number out of the water. Also, who cares? Are you in the business of measuring what rate the market will bear or the rate of inflation year over year? If you’ve chosen to use contractors, you’re going to have to pay for them.
  7. Unmeasurable. Change success rate. What the heck is a successful change? One that makes it to production? One that makes it to production with no bugs? One that yields no bugs once in production?
  8. Lagging. All these metrics are after-the-fact. If you live by these metrics you have to wait for bad things to happen to you before you realize it. Find some leading indicators of when you’ll be off budget or off schedule or off quality targets and use those instead.

I could go on, but why bother. This is just too easy to pick apart. If you think implementing 100 metrics is the right thing to do, you’re off your rocker. 50 is probably too many. 25 is too. Think critical few things that make your business tick. I’d guess the number is probably 3-4 output metrics and 3-4 input metrics per each output – somewhere in the range of 12 to 16 measures should get you most of the way there.


The measurement of success

September 14, 2009

How does your organization measure success? Is it your ever increasing profits? Your growing bottom line? Market share? Customer satisfaction?

Have you ever considered measuring your success as being the organization who has failed the least? Not the organization that succeeds the most, but just the one that isn’t last. As someone once said “you don’t have to be able to outrun the bear. You just have to be able to outrun the slowest guy being chased by the bear.” Still, I sat through a meeting the other day, where the praises of our “not sucking as much as everyone else” were being sung. Yes, really.

With the exception of the bear analogy, I don’t think this kind of measurement is a helpful one. Even if right now you are failing, but at the head of the pack of failures (i.e. failing the least out of all possible options), that doesn’t necessarily make you a winner. What if everyone running the marathon was measured by how far they made it before quitting rather than crossing the finish line? Nobody reaches the goal and yet we have a winner?

It just doesn’t make sense. If what you are doing isn’t meeting your customer needs, then being the closest to meeting their needs doesn’t mean your customer is happy. The space between your level of performance and what your customer actually desires, even if currently filled by no competitor in the market, is a space that can potentially be filled. And when someone fills that space, you will no longer be the guy outrunning all the rest – you’ll be the bear’s dinner.


Does anyone do primary research anymore?

September 9, 2009

I’ve been reading a lot of research lately about measurement in software development and they’re beginning to sound like a broken record. For a while I couldn’t quite put my finger on why it seemed like I was reading the same thing over and over and over again.

Finally, I figured it out. Nobody was doing any new primary research. These research papers sounded like each other because they all quoted a handful of papers who all quoted another handful of papers who all ultimately led back to just a few key pieces of research. Essentially, everything appears to have come from Capers Jones, Barry Boehm, Lister & Demarco or a handful of other places.

Everyone else’s stuff was just permutations of that. They were adding new opinions on top of the data, but no new data was being collected and analyzed. These papers lacked any statistical testing (or data for that matter) because no new research was being done. Now, maybe I’m just getting my hands on the wrong papers, but it seems far easier to find theoretical works rather than empirical works when it comes to computer science.

I find that strange, since I’m of the belief that computer science is just that, a science. It’s a science as much as any other engineering is a science. There are lots of ways to do it, but it turns out there are some best practices that just work better than others. And even though every software program we write is different, there are commonalities that allow us to generalize about situations. Somehow, unlike physics or other sciences, when someone comes out with research, nobody bothers to do anything to try and reproduce it. We just accept it at face value.

How is computer science ever supposed to advance as a science if nobody is willing to study it? We’re forever doomed to it being an art as long as we continue to treat it as one.

Get out there people and study your organizations! Build up knowledge about the science at your company because by golly we can’t count on someone else to do it for us – they’re just not doing it for some reason. Do primary research instead of relying on secondary papers which are all theoretical.


Independent Verifiability

August 31, 2009

As it is wont to do, the conversation about productivity has come up around the office again. Like many development shops, we fail to measure any form of productivity at all. And the primary reason we don’t measure productivity is not because we don’t track where our employees’ time goes, but because we don’t track how much work they’re doing.

See, unlike a factory which produces, say, nuts and bolts, it is difficult to count a delivered unit of functionality in software. The conventional options seem to fail us. Lines of code (LOC) is troublesome because even in an unwatched environment, the person to person variability in how many lines of code you write to get something done is quite different. Once you begin monitoring people for how many lines of code they write it encourages them to write extra lines of code, whether they are necessary for the functionality or not, in order to inflate their productivity.

The other measure commonly used, but still problematic is Function Points (FPs). FPs don’t get heavy adoption because it requires expertise to implement (plus ongoing costs) AND it turns out that developers can actually overcomplicate their design to increase the number of function points delivered – an undesirable result.

Maybe this is self evident, but the problem with both of these measures is that they aren’t independently verifiable. Instead, the counting of productivity (whether it be LOC or FPs) depends heavily on the person whose productivity you are measuring. By comparison, if you were at a factory of any kind, a layperson can count how many units come out the door. How much you produce can be verified independent of having any assistance in figuring it out.

So, what does this mean? If our existing measures of productivity can be gamed, what are we left with? Here’s my idea: we use one team in the company to measure the other. For example, we measure developer productivity in the number of test cases needed to verify all the functionality they deliver. The (perhaps big) assumption is that they developer delivers no more functionality than requested by the business, and thus the cases meant to verify that said functionality were delivered act as an independent verification of the developer. Now, since the tester is acting as the counter, you cannot measure the tester’s productivity as cost / test case executed, since that would encourage the person acting as your measurement system to want to needlessly increase the number of test cases. Indeed, we should measure testers not on cases ran but on valid defects detected, since it is their job to appraise the system and writing unnecessary test cases doesn’t add to the appraisal process.

In this way, the measurement system balances itself. The test organization measures the developers’ productivity via developer effort / test case needed to verify delivered functionality. The development organization watches over the test team because they can cancel invalid defects. Each team acts as the independent verifier of the other, thus escaping the issue of putting both the work and the reporting of how much work got done in the hands of a single team.


Not one of the critical few

July 17, 2009

Ingrained in the Six Sigma school of thought is the critical few – the 80/20 rule. It is an important rule. In practice, there are a handful of things which often allow you to make big leaps from an incapable process to a capable one. There are more subtle characteristics of the process which can be refined to continually improve the performance, but this isn’t step change, it is refinement. And then there’s a class of things that just don’t matter.

So as I sat today through a long, long meeting trying to define a process, I spent a lot of time thinking about the things that don’t matter. That may have been because that’s all anyone spent their time talking about. And as facilitators, we were enablers of this dragging on. Having been instructed to drive to a single standard process and toolset, we discussed every little one-off thing that people wanted to allow for in the process to see if we could squeeze them out. A day’s worth of 25 people’s time to design a process spent talking about the equivalent of the carpet color.

We wanted perfect compliance to the standard, and that meant a standard which was not necessarily all-inclusive (because some of these one-off requests were truly ridiculous by any standard). This is where I believe we got off track with process work. Process design is about controlling the critical few things which will make the difference in process performance.

But that is not what we were discussing. We were discussing nuances, oddball cases, odd uses of the process, and data elements that some teams wanted and others didn’t. We talked about the 1% and largely ignored the 99%. We talked about things that weren’t going to make the difference, whether they were one way or another.

To begin with, we didn’t know what was going to make the difference. We hadn’t studied the existing processes to understand what made them work – what really mattered and what didn’t. This created unnecessary room for debate because we were unable to bring adequate materials to the table to help the team work through their differences. We had little to no information on what mattered and what didn’t.

Instead of define-measure-analyze-improve-control we just went right into improve. And there we got bogged down discussing every little quirk, because we didn’t know what else we ought to be talking about. Or more importantly, what we shouldn’t be talking about.

Instead of a conversation that was “do we really need that? How many of our teams use that process step?” we could have instead said “sure, it doesn’t matter to me if you allow for that.” And we’d be saying that not because we didn’t care but because we actually knew what did matter. Everything else, the little things that we debated with the teams could have instead been bargaining chips that we could dole out in heaps and have given up basically nothing that really mattered. We could have had a strong position, not because we won all the arguments but because we knew which battles were worth fighting and which were worth conceding.

Had we known what things were not one of the critical few things, we could have appeared very agreeable and allowed the teams as much “leeway” in the process as they claimed they needed. All along we’d be giving up nothing. Nothing that really mattered anyway.

It’s a reminder why a thorough measurement and analysis of a process is important. It isn’t just discovering what the current state is (measurement), but it also understanding why it works (analysis). And from there, narrowing down the bits of process that really do matter, and just letting the rest go. Some things just don’t matter.


10 Signs that your X is good/bad/etc.

June 20, 2009

Note:  My apologies.  I wrote this post from a Mac, and it seems to have lost lots of content in the middle plus screwed up the title.  I have no idea why, but I guess that’s the last time I’ll do that.  I’ve done my best to fix it.

My father in law is a psychologist.  I find it fascinating, since he has so many great stories about treating couples.  I love a good medical story.  So it was a strange coincidence that we (my wife, daughter and I) were visiting them for the weekend AND at the same time I was helping my brother develop a new website.

As I was showing off the website design to my father in law, he said “I’ve got this great site which I think is a great example of good design.”  And with that he sent me off to relational-coaching.com where I stumbled upon this article. 10 signs of a great relationship, huh?

I’m not qualified to decide whether she’s right or wrong, or whether 10 things is the right number of things, but it did get me thinking.  All the time we run into top N lists.  They appear to come in certain flavors:

  1. Top Ten lists
  2. Top Five lists
  3. Top Seven lists (thanks very much Steven Covey).  Though I think the association with lucky 7 might have something to do with it
  4. Top 50
  5. Top 100
  6. Plus 1 lists (ie 101 top things instead of 100)

Sometimes a list of 3, but it’s odd to see a list of 27, for example.  Whatever has caused us to gravitate to lists of N items, and why 3, 5, 7, 10, 50 and 100 have become those lengths, I don’t understand.

I know, I know, so what?  Well, in our efforts to make nice “round” number lists, are we missing something important?

I see it all the time at work.  For example, the other day I was given a list of the Top 5 projects.  But why not the top 6 or top 7?  Someone chose 5, but was it a logical break point?  No, as a matter of fact, it wasn’t.  There can be 300 or more projects at any moment active, but a scant few of them are really large (say greater than 1 million in spend).  And 1M in spend would be arbitrary too, but there is a break point.  Generally, we see little projects, those under $.5 M, almost nothing in the $.5M to $1M space, and then projects $>1M.  So the >$1M break point makes some sense.  Anyway, there were, oh I don’t remember, 32 projects >$1M.  

But still we do top 5.  The convenience of having settled on these numbers (3, 5, 10…) should not outweigh a look at your data to see if there’s a logical place to say “this group looks different from that group.”  If you want to talk about “the big stuff” then figure out what it really means to be “big” and set the criteria that way.

Had the situation been different and we had a continuous range of projects costs from 0 through >$1M, you’d need to do some hard thinking about where to segregate the populations, if you should even segregate at all.

Don’t pick the top N because it feels familiar.  Pick the top N only if it makes sense.  And don’t assume that N has to be one of 3, 5, 10 and so on.  If it’s 27 items, so be it.


Leaner today

June 17, 2009

My wife and I are attending a wedding in the next few weeks. That’s a generally unremarkable occurrence. We’re about the age where our friends are getting married and starting to think about kids. In fact, I generally dread the events since we’ve been to so many and each one costing us a gift, a hotel room for a night or two, a sitter for our daughter, yet another new dress for my wife, and the list goes on…

Going to weddings is expensive. At any rate, we had not actually been to a wedding in quite a while. I have recently lost some weight, so I decided to try on my suit and make sure it still fit. Alas, it did not, so adding to this wedding’s tab will be a tailoring of my suit.

At my lunch break I popped down to the local tailor (who is just the stereotype I imagined sitting in his little shop with his thick eastern European accent). He has me put on my suit and stand up on the little, what do you call it, soapbox, I guess, to have a look see. I mention in passing that I’ve lost some weight recently.

He’s kind of tugging here and there, getting a sense for it, mumbles something under his breath about it being a nice suit (which I appreciate). Eventually he looks up at me and says “you’ve lost 30 or 40 pounds, no?” I smile. Hey, just because I’m a guy doesn’t mean I can’t appreciate this compliment.

“No, ” I reply, “I don’t think that much.”

“Ah, it must have been a little too large to begin with.”

“Hmph,” I think. Some nerve. First he compliments my weight loss (which I admit I had not lost 30 or 40 pounds – or at least I don’t think so)  and then suddenly it’s a suit that was always too large…

Too large to begin with, eh?  That reminds me of something. Measurements! Recently at work, someone asked me how we compared to our competitors in regards to development efficiency. Setting aside the fact that nobody can agree on exactly how we should measure efficiency, I reply “what does it matter?”

“Well don’t you want to know how we’re doing?”

“Is our customer happy with our performance, “ I ask.

“No.”

“Then it doesn’t matter how our competitors are doing.  We are not doing well enough.”

It’s like my suit. Sure, I’m too small for my suit to begin with. Compared to my suit, I am leaner than the suit would hold, but am I happy with my weight? I guess I’m ok with it, but I could stand to lose a few more pounds.

By comparison, we may be better than our competitors when it comes to development efficiency. The suit sized for our competitors is too big for us, so to speak. Alas, it doesn’t matter, because the customer (or in the case of my suit, my wife) doesn’t really care that you are good in comparison. Sure, I guess maybe my wife is glad that I’m not the thousand pound or five hundred pound or even two hundred pound man coming down the street, but I still could be a bit leaner than I am. And so can your company. It’s not success just because you beat everyone else. If it isn’t good enough for your customer, it isn’t good enough.


Excited about nothing

May 22, 2009

Lately I’ve been doing a lot of work with the Quality Assurance department to try and lean out their operations. If I haven’t said so in prior posts, I thoroughly believe that the entire QA department (at every company, not just the one I’m at) should be figuring out how to get rid of itself. No matter how you slice it, QA’s job is to find defects and that means rework. Just don’t create the defects in the first place! Ok, I know that’s not realistic, but it at least means that you have to do whatever in your power to minimize the necessity of slow and relatively ineffective testing.

At any rate, I’ve been working on measuring the organizational efficiency and I was comparing their test case execution patterns to that of a known sample from a whitepaper I had gotten from IBM. IBM had recognized that there is an S-shaped curve to the cumulative execution of cases. That is, you start off slow, ramp up, and then as you reach the end, the last few cases take a longer time to get done. I don’t know why this is particularly, but I wondered if the same applied here.

And that reminded me of a story about a college professor. Professor Reid was a geology professor at my college, and the way my college curriculum worked, even if you weren’t majoring in the natural sciences you still had to either take a certain number of courses or do a project in the natural sciences. I opted to do a project, though I had no idea what that project was going to be. Fortunately, someone lined me up with Professor Reid.

Professor Reid told me that he had taken a bunch of high school students (on some sort of outreach program) to Shapiro Brook, a generally unremarkable brook which ran down the side of a nearby mountain. At the top of the mountain where the brook sprang from the ground was a quarry.

Now, I’m probably going to get this wrong, so if you are a science buff, I apologize. If you are a science student looking for information on conductivity or pH, this is NOT the place you want to look. You’ve been warned.

Anyway, apparently, the behavior of “normal” brooks is that when the water springs from the ground it has relatively high pH and low conductivity. This is due to there being lots of free H+ ions in the water. As the brook travels over the surface, the free H+ ions are bound by Potassium (K) and Sodium (Na). As a result, this causes the water to become more neutral in pH (pH drops) and more conductive (conductivity rises). As I said, that’s the “normal” behavior.

What Dr. Reid and his students found was the exact opposite. For some reason, pH rose and conductivity dropped. He found this fascinating and wanted me to repeat the experiment, bring back results and finally even put some of that stuff through a Plasma Mass Spectrometer. The Plasma Mass Spectrometer is the kind of equipment that GRAD students wait in line to use, so I was super excited to have the opportunity. Dr. Reid thought, by the way, that the active quarry at the mountaintop was somehow impacting the pH and conductivity of the brook, though he wasn’t sure what the mechanism was exactly.

Anyway, early that fall, I walked up the mountain with a conductivity meter and about 40 little plastic vials which I had properly cleaned with DI Water… blah, blah, blah I won’t bore you with the details of my proper experiment preparations. Every 50 yards or so I took a vial of water and a conductivity reading. When I got back to the bottom of the mountain, I pulled out my map that I had been given. I don’t know why I did this AFTER, but I did. And that’s when I realized I had walked the WRONG brook. Now, I was a college student who was just trying to complete a coursework requirement. I could’ve just used the data I had, forgetting whether the results were honest or not. But, no, I felt guilty doing such a thing, though it crossed my mind, so I went back to the lab, cleaned 40 more vials and trudged back up the mountain this time with my map out in the first place.

Again, I went down the mountain collecting samples every 50 yards or so. Once winter fell, I returned to the same brook to repeat the experiment. We did this to make sure that little feeder streams weren’t influencing the main brook. Of course, this time instead of walking down some of the mountainside, I fell and tore up my hand and wrist pretty good. Determined to not have to go back and make yet another trip, I ripped off some of my shirt, wrapped my hand and wrist (that was probably melodramatic of me), and proceeded to complete my measurements.

When I got back to the lab, I carefully tested the pH of every vial and recorded the data. Then, I brought all my results and readings back to Dr. Reid. I couldn’t really make heads or tails of it, but he could. He literally started bouncing up and down in his chair with excitement. Not in some sort of ridiculous way, but just a little more spring as he talked to me, and his eyes lit up, and a big smile came to his face.

“NOTHING! Shapiro Brook behaves just as it should!” he exclaimed.

I was heartbroken. How was I supposed to write a college paper on nothing? Dr. Reid was undeterred. He proceeded to tell me how great this is, to disprove that there was anything special about Shapiro Brook at all. To in fact find that the world worked exactly as we would expect it to work was, to him, joyful. “You could be a science guy,” he said to me, “have you ever considered switching concentrations?”

And that stuck with me through all these years. When Dr. Reid passed away in the early 2000s, it was this story that first came to mind, and the story that came to mind when I pulled together my data for Quality Assurance.

Sure enough, our QA teams experience the same patterns of progress that IBM had observed. The S-shaped curve wasn’t just some IBM myth. I’m not a QA person, just as I wasn’t a “science guy” back in college, so maybe all QA people know this, but I didn’t. There was excitement discovering that we were just like everyone else, so I sent an email titled “so cool!!!” with the details of my findings to a good friend who I knew would appreciate it. There is satisfaction in finding out that we are not special or different, that despite what people believe, things that the outside world experiences can be applied to us. It gives us hope that what we learn elsewhere is transferrable knowledge.