explainlikeimfive

ELI5: Why/When/How do we take the natural log of data sets?

I am currently looking at water quality data over time for a well. We use a cumulative sum (CUSUM) model to determine when/if a significant shift in the average occurs for any of the minerals in the water.

For some of the minerals, it was determined that we needed to take the natural log of the data in order to achieve "normally distributed data".

I like math, and took all 4 years of calculus back in college, but statistics have always vexed me. How do I know when a data set should be log-transformed? Secondly, how do I handle/discuss the data on the other end of that transformation? Because from what I understand, it is now unitless.

https://www.reddit.com/r/explainlikeimfive/comments/1luyyih/eli5_whywhenhow_do_we_take_the_natural_log_of/
Reddit

Discussion

TenchuReddit

Let’s say you’re tracking something that is going down by one half every day. That measurement starts at 10, then goes to 5 the next day, then 2.5, 1.25, 0.625, 0.3215, etc. If you plot a regular distribution of the data, you’ll find that most of the data points will be in the range between 0-1. All the other ranges will be very sparsely populated. Moreover, if you plot a graph of that data over time, you’ll end up with a curve that doesn’t tell you much as the curve flattens out over time.

Hence it makes more sense to plot the log of that data over time. That plot will look linear, and it will show that, indeed, your data points are going down by 1/2 every day. That’s easier to see than the previous graph, which shows a curve but is hard to tell what’s going on over time.

1 day ago
Matthew_Daly

More importantly, if it's something where you would roughly expect the measurement to be cut in half every day, the data will far more effectively illuminate where it is varying from that pattern when you do a log plot. Your brain is much more attuned to thinking "Wait, that isn't a straight line!" than it is to thinking "Wait, that isn't a natural exponential decay curve!"

1 day ago
majwilsonlion

To answer the first part of the question, you would want to look at the range of the data. If data points are spread across a range that have a similar/close number of digits, like 01 to 999, then a linear scale is typically fine. But if the dependent data is orders of magnitude different, like 01, 1001, 10k, 1M (an exaggerated spread in my example), then you would want to use a log scale. Basically, if you are reporting data that has exponential growth, then a log scale is best for visualization.

A natural log (ln) may be more helpful for data that is based upon nature/natural information. But otherwise, a normal base-10 logarithmic (log) scale should suffice.

1 day ago
davideogameman

The base of the logarithm doesn't really matter; base 10 is just typically more convenient for people to understand. 

This is because of the change of base formula: log_b a = (log_c a)/(log_c b) for any positive a,b,c.  So all logarithmic graphs are related by a constant factor

1 day ago
majwilsonlion

Thanks. I come from an EE background, and all the equations are in ln, which I attributed to all the "natural" device physics relationships. But if I was to plot e.g. Disco Stu's sales of disco music in the 70s, I would choose base-10.

22 hours ago
GXWT

Your spread isn’t so exaggerated depending on what you’re doing and what field you’re in - I deal with data spanning more orders of magnitude than that in astrophysics.

22 hours ago
RosharanChicken OP

Thank you! That makes sense when I look at the parameters that they chose to natural log-transform.

1 day ago
Mr2-1782Man

Let's take a step back first. The normal distribution is used for a lot of statistical analysis. Its the bell curve that you frequently see in statistical discussion. Some datasets aren't normally distributed. For minerals dissolved in water you can't have a number below 0 which the normal distribution requires. Most of the common statistical tests don't give the correct results unless the data is normally distributed.

Some data can be transformed to be normally distributed. This requires a lot of care to do correctly and isn't an ELI5 topic. For most fields there's a set of rules for how to do this in a statistically robust manner. When you apply a basic log transform a value of 0 becomes one and values drop off to either side. The log transform goes to negative infinity at zero which on the surface makes it look closer to normally distributed. Whoever came up with the idea of the log-transform probably did a lot of work to select values to make sure it did the job.

You really shouldn't be figuring it out yourself. This is a case where you need to read up on what others have done or ask an expert in statistics to make sure its right. My stats prof told us a story of how a researcher applied a statistical method to a some data they had collected. They assumed the methods would apply because they used them before. My stats prof told them they did the math wrong and the results had to be thrown out and the tests redone with input from a math expert.

1 day ago
Hot-Chemist1784

log transform is usually used when data is skewed or has outliers messing with normality.

after analysis, just exponentiate results back to original scale and interpret in original units.

1 day ago
Ok-Hat-8711

Using a log of the raw data is a useful tool when you are concerned with relative changes of a value.

Suppose you have 3 substances A, B, and C that you are checking the concentrations of. Each one increases by 2 parts ber billion (ppb) in the latest measurement.

Substance A had a original value measuring in at 500 parts per billion. An increase of 2 ppb is probably not a significant factor.

Substance B started at 7 ppb. An increase of 2 is potentially worrying and you would want to keep an eye on it.

Substance C started at 0.1 ppb. An increase of 2 is quite significant for something with such a low baseline.

If you were viewing all of the trending data in some sort of chart with linear axes zoomed to fit substance A, you might not notice the changes on substances B and C, despite them being a bigger concern.

But if the chart has a log plot on the y axis, then proportional changes are easier to notice. Substance A will have an insignificant bump. Substance B will rise noticeably. And substance C will catastrophically leap upwards, just from a lower starting point. The fact that the measurements have different orders of magnitude in absolute value will no longer distract from large changes in relative value.

If natural log data is only being requested for some substances, then maybe the people requesting are only concerned when those particular ones were to double or triple from their initial state rather than cross an arbitrary danger line.

1 day ago
SapphirePath

Some data 'behave' linearly - the change from 5 mg to 7 mg and the change from 20,000 mg to 20,002 mg are both treated equivalently as +2 mg.

Other data behave exponentially (proportionally) - the change from 5 mg to 7 mg is treated as +40%, which would be equivalent to a change from 20,000 mg to 28,000 mg.

After taking the logarithm of every data point, differences that were exponential are represented by linear differences.

Places this might make sense include financial data such as stock prices and biological data such as rabbit populations. In those cases, the growth (or decay) is proportionally dependent upon the current value.

(The process of becoming unitless happens like this: you first strip the data of its units by replacing each data point by its unitless ratio to a predetermined base value: 5mg/1mg = 5; $183/$1 = 183 etc ... This happens prior to taking the log but is essentially invisible.)

If you like, you can use semi-log graphics to preserve units - the y-axis is labeled with equally-spaced units labeled 1mg, 10mg, 100mg, etc. Another approach is to exponentiate back after you're finished to return to your original units.

1 day ago
SHOW_ME_UR_KITTY

Log normal data pops up all the time in lab data. Let’s say you are measuring trace amounts of lead in drinking water,  and take a years worth of data, aggregate it and see the average is 20 ppm with a standard deviation of 15 ppm. This means that negative values are within 2 sigma of the mean, which can’t happen. If you instead take the log of the data, measure the average and standard deviation, use the sigma to get some confidence interval, then convert those back from log-space to linear space, you can have more meaningful confidence intervals.

21 hours ago