Are your data normal? Hint: no.

One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.

There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.

But these tests are almost never what you really want. When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset. And that’s not a statistical test; it is a modeling decision.

All statistical analysis is based on models, and all models are based on simplifications. Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.

For example, the normal distribution is a good model for many physical quantities. The distribution of human height is approximately normal (see this previous blog post). But human heights are not normally distributed. For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions. But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.

So if you collect a sample of human heights and ask whether they come from a normal distribution, the answer is no. And if you apply a statistical test, you will eventually (given enough data) reject the hypothesis that the data came from the distribution.

Instead of testing whether a dataset is drawn from a distribution, let’s ask what I think is the right question: how can you tell whether a distribution is a good model for a dataset?

The best approach is to create a visualization that compares the data and the model, and there are several ways to show this comparison.

Comparing to a known distribution

If you want to compare data to a known distribution, the simplest visualization is to plot the CDFs of the model and the data on the same axes. Here’s an example from an earlier blog post:

khan3

I used this figure to confirm that the data generated by my simulation matches a chi-squared distribution with three degrees of freedom.

Many people are more familiar with histograms than CDFs, so they sometimes try to compare histograms or PMFs. This is a bad idea. Use CDFs.

Comparing to an estimated distribution

If you know what family of distributions you want to use as a model, but don’t know the parameters, you can use the data to estimate the parameters, and then compare the estimated distribution to the data. Here’s an example from Think Stats that compares the distribution of birth weight (for live births in the U.S.) to a normal model:

nsfg_birthwgt_model

This figure shows that the normal model is a good match for the data, but there are more light babies than we would expect from a normal distribution. This model is probably good enough for many purposes, but probably not for research on premature babies, which account for the deviation from the normal model. In that case, a better model might be a mixture of two normal distribution.

Depending on the data, you might want to transform the axes. For example, the following plot compares the distribution of adult weight for a sample of Americans to a lognormal model:

brfss_weight_log

I plotted the x-axis on a log scale because under a log transform a lognormal distribution is a nice symmetric sigmoid where both tails are visible and easy to compare. On a linear scale the left tail is too compressed to see clearly.

Instead of plotting CDFs, a common alternative is a Q-Q plot, which you can generate like this:

data normal code ex 1
cdf and fit are Cdf objects as defined in thinkbayes.py.

The ps are the percentile ranks from the actual CDF. actualcontains the corresponding values from the CDF. fitted contains values generated by looping through the ps and, for each percentile rank, looking up the corresponding value in the fitted CDF.

The resulting plot shows fitted values on the x-axis and actual values on the y-axis. If the model matches the data, the plot should be the identity line (y=x).

Here’s an example from an earlier blog post:

kidney3

The blue line is the identity line. Clearly the model is a good match for the data, except for the last point. In this example, I reversed the axes and put the actual values on the x-axis. It seemed like a good idea at the time, but probably was not.

Comparing to a family of distributions

When you compare data to an estimated model, the quality of fit depends on the quality of your estimate. But there are ways to avoid this problem.

For some common families of distributions, there are simple mathematical operations that transform the CDF to a straight line. For example, if you plot the complementary CDF of an exponential distribution on a log-y scale, the result is a straight line (see this section of Think Stats).

Here’s an example that shows the distribution of time between deliveries (births) at a hospital in Australia:

interarrivals_logy

The result is approximately a straight line, so the exponential model is probably a reasonable choice.
There are similar transformations for the Weibull and Pareto distributions.

Comparing to a normal distribution

Sadly, things are not so easy for the normal distribution. But there is a good alternative: the normal probability plot.

There are two ways to generate normal probability plots. One is based on rankits, but there is a simpler method:

  1. From a normal distribution with µ = 0 and σ = 1, generate a sample with the same size as your dataset and sort it.
  2. Sort the values in the dataset.
  3. Plot the sorted values from your dataset versus the random values.

Here’s an example using the birthweight data from a previous figure:

nsfg_birthwgt_normal

A normal probability plot is basically a Q-Q plot, so you read it the same way. This figure shows that the data deviate from the model in the range from 1.5 to 2.5 standard deviations below the mean. In this range, the actual birth weights are lower than expected according to the model (I should have plotted the model as a straight line; since I didn’t, you have to imagine it).

There’s another example of a probability plot (including a fitted line) in this previous post.

The normal probability plot works because the family of normal distributions is closed under linear transformation, so it also works with other stable distributions (see http://en.wikipedia.org/wiki/Stable_distribution).

In summary: choosing an analytic model is not a statistical question; it is a modeling decision. No statistical test can tell you whether a particular distribution is a good model for your data. In general, modeling decisions are hard, but I think the visualizations in this article are some of the best tools to guide those decisions.

Related Resources

Editor’s Note: This post originally appeared in the Probably Overthinking It blog. It has been edited.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata in Santa Clara: February 11-13 | Santa Clara, CA
Strata Rx Health Data Conference: April 23-25 | Boston, MA

tags: , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.