Many a time I have had to explain to folks why they should not be converting data with more resolution into data with less resolution. On the spectrum of “resolution” you have attribute data (categories or characteristics you can assign to a given item), count data and continuous data. I’m going to assume, having refreshed your memory on the categories, that I needn’t explain them further.
One of the most common ones I see is the conversion of count data into attribute data on surveys. People will create a survey question like “how many days on average do you work from home in a week?” And rather than just leave a blank spot for someone to answer, they create arbitrary buckets like “2 or less”, “3-4″, and “5 or more.” I’ve always wondered, when someone does this, if they realize the impact of their choice. In the 2 or less category the person has managed to include 0, which to me is distinctly different from 1-2 times a week. On the upside, at least the upper boundary of this data is known – you can’t work from home more than 7 days a week.
Which brings me to my particular funny experience with attribute data. One of my Green Belts was working on a project and she wanted to develop a survey to collect some data. The question she wanted to ask was “what was the size of the project you were working on?” The answer she proposed had three buckets – small, medium or large. Her question to bunch of us was what’s the current definition of a small project, medium project or large project. Suddenly I realized that the problem here was that the “large” bucket had no possibility of an upper bound when converting continuous data.
For example, let’s arbitrarily define a small project to be between $0 – $250,000, a medium to be $250,001 – $500,000 and a large to be $500,001 – what? Suddenly the buckets don’t represent equally sized groups. Large has no upper boundary created by the next larger bucket. (On a little tangent, one thing that drives me crazy are groups that are $0 – 250,000 and 250,000-500,000. The upper boundary of the first category overlaps the lower boundary of the second category. So, if someone gave me a project that cost exactly 250,000 which bucket should I choose?)
If you are trying to make some determination about the nature of work being done using these small, medium and large characterizations, the large bucket contains everything up to and including the biggest project the company has ever done, but we stop making a distinction between large and very large, jumbo or humungous. And I wonder if the difference between small, medium and large mattered, shouldn’t the various scales of large matter as well.
At any rate, it’s a good reason not to convert continuous data to attribute data. Unless you know the true boundaries of the continuous data or you’ve discerned from samples of data the logical groups that naturally occur, you probably should just ask the question “how much did the project cost?” instead of giving them just a few buckets. From the small and medium buckets above, I’d assume that large was between $500,001 and $750,000. A $10,000,000 project (and we do quite a few of those) would end up in the same group with the $750,000 projects.
If you’re concerned that the respondent might round off the answer, it’s really no worse than when you provide them arbitrarily sized buckets. You can always tell them in the survey that rounding to the closest $100,000 or whatever is fine or that exact answers are important. You lose a lot of information when the final bucket you have becomes a catch all for everything else.
Posted by ProcessRants
Posted by ProcessRants
Posted by ProcessRants