Measures for a non-existent process?

November 20, 2008

“Hey look, I’ve collected all the dimensions for a car!” my friend said to me.

“That’s great, too bad you’ve trying to build a bicycle” I replied.

OK, not really, but simply replace car with “old process” and bicycle with “new process” and you do have the exchange we had.  My friend was working with a team to create a new process to plan resource usage.  If you’re at a large company, this is a dreadful undertaking no matter what, so figuring out how to make it run more smoothly would be great.

Consider that you essentially have to line up two things.  First, you have a population of resources (employees) who have varying skills sets.  Depending on the level of detail you want to get to, you have managers, project leads, developers, architects, quality assurance leads, quality assurance engineers, release managers… and the list goes on and on.

Then, on the other side you have projects which have demand for those resources.  Ah, were it only so simple to just start plopping available resources into the demand slots.  Unfortunately, you’re somewhat constrained by other factors like the fact that two developers aren’t necessarily the same.  You might need a C programmer and a Java programmer.  If all you’ve got is a COBOL programmer, you’re in trouble.  And more subtle variations, you need a C programmer, and you have a C programmer, but not as experienced a programmer as you want.

And it gets worse.  Demand isn’t a set in stone thing.  Instead, demand fluctuates.  You think the project is approved, but then it gets cancelled or it never gets off the ground.  How do you account for the fact that demand is more like a definite maybe?  How do you plan an entire year on a maybe?

It’s all very frustrating.  I’m sure there are people who are good at it for hundreds of people, but I frankly have no interest in figuring out how to get several hundred people sitting in the right chairs.  But my friend did; after all, it was his project.

Anyway, his boss asked him to develop measures for the bicycle, er, new process, I mean.  We, reasonably, needed to know if the process was working and what leading indicators would indicate corrective action was needed.  And where did they start?  Well, they brainstormed off how the old process worked.  Does something seem amiss to you?

You said the old thing was a car, you are building something that looks sort of similar – a bicycle.  Sure, they both have wheels, at least one seat and some way to steer it, but it’s a hard comparison to take much further.  So, would you try and measure the spark plug gap on a bicycle?  Of course not, bicycles don’t have spark plugs!

And the same is true of your old and new process.  You can’t take measurements from the old process and just use them on the new one.  Sure, some things are the same, but the right place to look for ideas on what to measure about the new process would be, well, the new process map you created.

Or, in my friend’s case, since they hadn’t actually defined the new process map yet, it was probably a bit premature to be talking about measures.  Defining your future on the basis of a past you just threw away doesn’t make a lot of sense.


The Pareto Chart of Blame

November 17, 2008

I get requests for data from people a lot.  I suppose it’s because I have access to most of the process data that we do gather.  One of the things people always ask for when it comes to data is who did this or that.  I’m constantly getting people the name of the person who opened the ticket or started the process or closed the ticket or whatever.

And I started to wonder what people were doing with this data.  Was the name Mike highly correlated to defects?  Was the number of syllables in their surname directly related to how long it took to fix a defect?  Alas, no, they were creating my newly dubbed “Pareto of Blame” (insert dramatic chord here).

people-pareto

What in heck would you do with this Pareto chart?!?!  No wait, let me guess.  Since Mike, Bob and Tim are collectively responsible for 80% of the defects in tickets, we should do which of the following?  A. Fire them, B. yell at them about their ticket quality, C. send them through painful retraining, or D. None of the Above.  The correct answer is D

This is a stupid graphic.  If you’ve found yourself EVER making this graphic, do us all a favor and apply option A above to yourself.  Here’s why this chart should never be built.  Even if you “fix” Mike – and oh you’ll fix him right good, won’t you? – there will simply be someone back to take his place. 

Charts like this ignore problems and replace them with blame.  Mike, Bob and Tim might be your most productive people.  They submit so many more tickets than everyone else that there is simply a much bigger opportunity for them to create a defect.  And chances are, all three of them make very similar mistakes.  If there’s a vague place in the ticket, you can bet they’re all getting it wrong.  In fact, it may be the same reasons that Joe, Fred, Sam, Ed, John and Rich all create defective tickets as well.

If you focus your efforts just on Mike, or Bob or Tim then Joe, Fred, Sam and so on never benefit from any of the improvements you make.  And what improvements exactly will you be making to Mike!?  Will you be replacing the faulty section of his brain with a computer chip?  No, unfortunately, other than verbal abuse and painful retraining, there isn’t a whole lot you can do to “fix” Mike.  Verbal abuse isn’t effective, although potentially fun.  Most likely you’ll take a major productivity hit from Mike and chances are you’ll get malicious compliance from here on out.

You could retrain him, but training is largely ineffective.  Mike forgets.  He’s old and has been around a long time.  You’ll have to forgive him that.  Even if your “Mike” is young and sprightly, you can bet that being trained on the intricacies of entering a ticket is not high on the list of his ambitions.  He’s probably not going to pay that much attention.  And what do you have to look forward to?  If Mike leaves the company, some other person is simply going to take his place as the most defective.

As you can see, anything you’d do to remedy Mike isn’t worth doing.  Don’t make this chart!  Stop asking for this data!  It is not the fault of the people that they cannot follow your process properly.  The blame falls on the process and as such the way you look at the problems you have with the process needn’t involve anyone’s name at all.


Workflow management isn’t about online forms

November 16, 2008

Putting a process on line isn’t just putting the forms on a website.  This is how most people treat workflow management software, though, so it’s no surprise that they hate the software.  We have such a system at work, and I can see why people hate it.

For example, let’s say that I want to get support for a production server.  I have to submit a support ticket, but not just any support ticket will do.  There are about 30, yes 30, types of support tickets I can choose from!  Why are there so many?  Because each group who supports some system wanted their own form.  And so the very first thing I have to do in the process of getting support is to make a decision – which ticket is the right ticket to open?  Get it wrong and I have to start the process all over again!  In reality, first I have to fill in a long form, effectively get the whole thing wrong and then do it again once my form is rejected and they tell me which form was actually the correct form.  Does it start to make your feel like you’re in a ridiculous government bureaucracy?  It does to me!

First off, I think it’s dumb that I have to make a decision about who gets the ticket in the first place.  Whether I had a single form or 30 is irrelevant.  Why do I have to tell people who gets the form?  I have no idea about the way support has their departments set up, and frankly I don’t care!  Really and truly, I don’t.  Secondly, if you are going to make me choose who gets the ticket, why do I get punished so heavily when I choose incorrectly?  I have to completely re-enter the same data on a different form.

It’s a great example of how people mistreat workflow management.  The process for me, includes deciding which group gets the ticket.  Why is that critical step, the very first step I can so easily screw up, not automated!?!?  Why is it that the magical decision tree about which form is right is a) nowhere to be found and b) not done for me?

The second example I have seen is very similar.  Every time someone closes a defect tracking ticket they must assign it a root cause.  The root cause data helps us figure out where we want to improve the process.  In fact, the first improvement would be to improve the root cause collection system.  As it is, people are given a list of choices such as “requirements issue”, “design issue”, “coding issue”, etc.  We don’t provide the definitions of what each one means, although it might seem obvious.  That is, until you get to the next step.  In the next step, once you’ve chosen a root cause, you must choose a slightly more detailed cause.  For coding, for example, there’s “internal error” or “vendor error.”  It all seems well and good until you get to this ambiguous one “not coded according to requirements.”

I have always read this to mean “the requirement was clear but the developer didn’t code it that way anyway.”  And yet, the other day someone said “no, it means that the requirement was ambiguous so that the developer didn’t know what to code.”  Really!?!?  I didn’t get that from the description.  Again, here’s a great example of leaving the process offline while having the form on line.

In order to arrive at the root cause, there are a series of questions you have to ask yourself.  It forms a decision tree.  Assuming you follow the decision tree, you get the right root cause.  If you don’t, and make a guess, you get what we have above which is a misunderstanding over what it means to have something “not coded to requirements.”  Instead of asking someone what the root cause is, ask them the questions from the decision tree.  Because the questions are simple yes/no questions, by the time they get to the end of the decision tree, it’s already been decided what the root cause is.

Was there a requirement for this defect?  If no, it’s a missing requirement.  If yes, continue.  Was the requirement clear?  If no, it’s a vague or incomplete requirement.  If yes, continue.  Did the design take the requirement into consideration?  If no, it was a design flaw.  If yes, continue.  Did the developer code the requirement as written?  If no, it is a requirement not coded as written.  And so on… you see how it goes.  But don’t just give me a drop down for the root cause.  Ask the questions.

The point is simple.  If you only have half the process on line, for example, just the forms, then the rest of the process is occurring without any errorproofing.  What’s the point of using workflow management for half the job?


No raindrop believes it is to blame for the flood

November 10, 2008

If you had a thousand defects in your product, how would you make them go away?  You’d fix them, of course, wouldn’t you?  Wouldn’t you!?!?!

And yet, here we are, staring that exact situation in the face.  A thousand open defects and growing.  Are they all show stoppers?  No, but they’re all valid defects.  Alas, the team was told to get the open inventory down.  And how could you do that? 

Well, you could fix the bugs.  This would be a reasonable conclusion if the queue wasn’t growing out of control.  If the bugs are coming in faster than they’re going out, you’re just treading water by fixing bugs.  The right thing to do would be to fix the upstream process.  If you stopped making so many bugs, the queue would stop growing and then you could fix the problems and the queue would go away.  That’s the RIGHT answer.

I didn’t realize there was a more wrong answer than trying to tackle the queue without solving for the root cause.  But there is.  Today I was told that rather than fixing the root cause of bugs and rather than fixing the bugs in the queue that we’d simply close things in the queue.  Unfixed!  Unresolved! 

You heard me right, just close them!  The reasoning is, some bugs just aren’t worth fixing.  I could grant you that this appears true on the surface.  If the bug’s potential financial impact is $100 each time it occurs and it costs $100,000 to fix it, then the ROI doesn’t seem to play out.  You’d have to have that bug pop up 1000 times.  So you skip that bug.  Seems fair.  Well what about the next bug that costs you $100, or the next one, or the next one?  Do you just skip them all?

We’re on a real roll now.  Suddenly bugs with a 5 or 10 year ROI aren’t getting done.  Maybe bugs that have a 3 year ROI are being skipped.  It really isn’t about that one bug you skipped.  It’s about the backwards philosophy that the issue can be looked at one bug at a time.  If this one isn’t so bad, then the next one just like it isn’t so bad.

Consider my phone company.  I use Vonage.  They’re not the greatest, I’ll admit, and I put up with a lot of poor phone quality.  The thing is, every once in a while, even with Verizon, you get a crappy connection.  You can’t hear them or they can’t hear you.  Whatever it is, you hang up frustrated and call back and it’s better.  But you’d be annoyed if at first it was a poor call, and then the phone wasn’t ringing and then the call got dropped and then your caller id worked intermittently, and so on.  Maybe one thing by itself is just annoying, but put them all together…

Well, what’s so different about bugs?  Sometimes you get bug A, which is annoying, but it is transient or you work around it, or whatever.  And then there’s bug B hiding right behind it.  And bug C and D and so on and so on.

Sure, by itself I can live with one minor but annoying bug.  Two, and I’m getting frustrated.  Three or more…  You see how it goes.  Dealing with a bug in isolation of all the others out there misses the point.  It’s not that I can’t live with this bug or that but, but I can’t live with all the bugs.  So closing them surely gets them out of the queue, but it doesn’t fix them or make them go away!  They’re still there.  Waiting to come back.  Waiting to be reported again.  Waiting in the wings to annoy someone and bring down the perception of your product’s quality.

Now, maybe you might conclude that this one bug is not worth fixing, but when you’re making that decision, consider this:  no raindrop believes it is to blame for the flood.  And yet, people drown in floods all the time.


Please refrain from poking the jellyfish

November 5, 2008

I feel like I’ve written this entry before, but I can’t find it when I look through my posts.  I guess that says two things – one, I don’t tag my posts adequately enough and two, I’ve written enough posts that it’s now hard to remember what I have and haven’t written about.  :)

About a month ago my boss asked me if I could do a review of all the root cause analysis we had done to date to determine if it was adequate.  It’s sort of an odd question in that I wasn’t sure if he wanted to know if we’d done ‘root cause analysis’, which to me means the measure and analyze phases of DMAIC or if he wanted to know if we’d done enough to address the root cause of issues.  The first is “did we do our homework” and the second is “if we did our homework, did we actually do anything with the information.”

My first, and I admit failed, attempt at this problem was to look at what work was going on in the entire development space.  I asked around, pulled from what I knew and essentially built up a map of our development process and overlaid it with all the initiatives to identify and fix the reasons that part of the process map was having problems.  For requirements, for example, I found a project which redesigned that phase.  There were similar projects for design, coding and testing.  I found evidence of ongoing management of bugs.  We have a process in place to locate where the most bugs are coming from.

In short, I concluded that we were indeed doing enough, but it was hard to tell if what we were doing was going to be effective or not.  After all, most of these initiatives didn’t follow any sort of structured thought process.  They simply popped forth from someone’s head like Athena from Zeus (I think I recall that myth correctly).  And most of them were currently in-flight.

My boss didn’t really like that, since my conclusion was basically “do nothing, we’re doing enough.”  So after some more discussion, he asked if we shouldn’t have some review of a sample of defects every month to see what the root cause is of each of those.  To which I said, “no, we shouldn’t, because your fire-fighting rather than solving why you get bugs in the first place.  If you get the process under control and at a level you can live with, then there’s no need to understand the root cause of every defect anymore.”

Essentially, my point was if you’ve fixed the process, and you have tons of initiatives to try and fix it, then why are you still looking for problems?  Again, that didn’t go over well, so I backed off and said I’d do some additional research.  Hey, I know how to play the game, even if I am stubborn sometimes.  I can’t just fold like a house of cards; I do want to go home with my integrity intact.

I pulled defects for the last two years and sliced them every which way to look for special cause variation in the number of defects over time.  Hooray C charts!  I also looked at defects per unit of work (very coarsely) by using defects per person month of effort.  All in all, I concluded that I could find no evidence that anything in the process had changed for the better or worse.  The story was a mixed bag.  Good news was we weren’t getting worse and the bad news is we weren’t getting any better either.  But, there’s something to be said for stability.  At least you know where you stand.

Anyway.  As I was driving home tonight a thought popped into my head about all this.  You can’t take an iterative approach, at least not like my boss was proposing, to fixing the process.  Now don’t get me wrong.  You absolutely, positively CAN make some fixes to the process, measure the process again, look for new data and continually improve it.  However, if you have, say, 40 things in flight to change the process and none have been given time to work, I don’t think it’s appropriate to be doing a root cause analysis of a sample of bugs coming in and making more changes.  And, as an alternative, if you’re not going to act on all this root cause work, what’s the point?  I can measure improvement simply with a control chart of defects per unit of work per month.  If we get special cause variation for the better there, then gosh darn it all, our changes are working.  I don’t need all that extra detail to tell me that.

If in-flight fix A hasn’t completed and would solve the problem you just reviewed, then you simply wait for A to go into effect and you’re done.  If instead, you simply make another change to react to what you are seeing now, then you’ll never know if A made the difference or if the thing you just did made the difference.  And while it’s great that the problem got fixed, your only focus cannot be on quality.

I can get you quality.  It’ll cost you limitless amounts of money, but you can have it.  We’ll do all kinds of quality activities, many of which that don’t pay dividends.  But it doesn’t make sense!

If you have measured the process and you know it is stable then you can feel confident that a good random sample of the problems you are seeing today WILL be representative of the problems you will see in the future.  THIS IS THE BIG SECRET OF GOOD SAMPLING!  You CAN learn about what will happen in the future by studying the past if the process is stable.  And if you bothered to get a good random sample, then when you make fixes to the process based on your analysis, you will be preventing the problems that actually are likely to occur again.  On the other hand, if you just react to the bugs you have today, and today’s the day that some idiot new developer introduced a whole bunch of NULL pointer exceptions, you’re going to pay for a process change that isn’t likely to happen again.

Combine that with an in-flux process and as Steve McConnell said in Code Complete “As popular as this practice is, it isn’t effective.  Making changes to code randomly is like poking a jellyfish with a stick to see if it moves.  You’re not learning anything; you’re just goofing around.”  The same applies to process work.  STOP POKING THE JELLYFISH!