Dear lord, it’s a statistics entry! Run for your lives! No, no, I promise this will be as painless as possible for you all. I want to talk about log-transformed charts. Why? Well, we often log transform a chart axis in order to expose data. When data has an extreme range, subtle details in the lower range of the data gets lost. So, we can transform the data to make it easier to see by applying a log transformation to one or both of the axes on a graph. Take for example this below graphic. There’s actually 9 data points, but 8 of them are all bunched down in the bottom left because of the extreme data point in the upper right.
Now, as we log-transform the data, we begin to see what we’re really looking at. First, on one axis, as below. Note, you can now see the data points but they seem to be all against the left edge and they look like they curve. When you log transform a single axis, linear data will take on a curved shape.
When you then log transform it on both axes, the curved shape is removed and we can see the data as it really is, linear, but much better exposed. Indeed, the data I created was based on a simple linear formula: Y = X * 100.
So, rule of thumb number 1 is, log transform both axes if you are going to do it at all. This of course becomes a problem when there are 0 values in your data, since you can’t log transform 0, but I’m not getting into that.
The second thing to be aware of is that log transformed data, if the original data has uniform variability at all levels of X will not appear that way once transformed. Instead, as the values get larger, the data points should appear to converge. Take for example, this simple example. Below we have two charts of data. On the first chart, the two data sets are:
Y = x * 10 and Y = x * 200. If you were to draw these on untransformed axes, the lines formed by the data would clearly diverge. Where X = 1, Y would be 10 and 200 respectively (a gap of 190). Where X = 100, Y would be 1000 and 20000 (a gap of 19,000). However, once you log transform the data, the lines appear parallel like the first chart below. Visually, you might say that if these lines represented the boundaries of some real scatter plot that the data appears to be homoscedastic (having uniform variability), but it isn’t! It’s a visual trick. Don’t get fooled by it!
By comparison, the second chart shows two lines which would be parallel on a chart with untransformed axes. One formula is Y = x + 10 while the other is Y = x + 500. At any level of X, the gap between the two data points is the same, 490 units. When you log transform this data on both axes you can see how it appears to converge. You might think that the data that exhibited this shape is heteroscedastic, but it isn’t! Again, you’d be fooled by the visual appearance of log-transformed data.
Why does this matter? Once you log transform data you often forget what you are seeing. You may actually have a spray pattern and think it is a simple linear correlation. This is often the case in software data, where larger projects may be generally more expensive/defect laden/time consuming than smaller projects, but the variability of the result is much, much higher. We get fooled into believing that we have a uniform linear relationship between any dependent variable we have and the outcome we are looking for when we do not. Don’t let making charts readable make you forget that log transformations, while exposing the data, change the shape of it in important ways anyway.




Posted on March 29, 2010
0