This article explores techniques that can be used to assess how well a parametric model fits an empirical dataset. Specifically, we’ll demonstrate how to produce the following visualizations:

  • Q-Q Plot Compares two probability distributions by plotting their quantiles against each other.
  • P-P Plot Compares two cumulative distribution functions against each other.
  • Histogram Plot density histogram with parametric distribution overlay.

And run the following tests:

  • Kolmogorov-Smirnov Test Test the equality of continuous, one-dimensional probability distributions.
  • Anderson-Darling Test Test whether a given sample is drawn from a given probability distribution.
  • Shapiro-Wilk Test Test the null hypothesis that the data is drawn from a normal distribution.

For ease of exposition, the same dataset will be used in each example, which is given as a Python list below:

dat_init = [
    62.55976, -14.71019, -20.67025, -35.43758, -10.65457,  21.55292, 
    41.26359,   0.33537, -14.43599, -40.66612,   6.45701, -40.39694, 
     55.1221,  24.50901,   6.61822, -29.10305,   6.21494,  15.25862,  
    13.54446,   2.48212,  -2.34573, -21.47846,   -5.0777,  26.48881, 
    -8.68764,  -5.49631,  42.58039,  -6.59111, -23.08169,  19.09755, 
   -21.35046,   0.24064,  -3.16365, -37.43091,  24.48556,    2.6263,  
    31.14471,   5.75287,  -46.8529, -14.26814,   8.41045,  18.11071, 
   -30.46438,  12.22195, -31.83203,  -8.09629,  52.06456, -24.30986, 
   -25.62359,   2.86882,  15.77073,  31.17838, -22.04998
    ]

The task is to assess how well our data fits a normal distribution parameterized with mean and variance computed as:

$$ \begin{align*} \bar{x} &= \frac{1}{n}\sum_{i=1}^{n} x_{i} \\ s^{2} &= \frac{1}{n-1}\sum_{i=1}^{n} (x_{i} - \bar{x})^2 \end{align*} $$

Keep in mind that although we’re testing how well our data can be approximated by a normal distribution, many of the tests we highlight in the forthcoming examples (with the exception of Shapiro-Wilk) can assess the quality of fit for many different parametric models.

We start with assessing goodness-of-fit visually.

Q-Q Plot

The Q-Q plot compares two probability distributions by plotting their quantiles against each other. We compare standard normal quantiles (x-axis) against the empirical quantiles from dat (y-axis). If the two distributions being compared are similar, the points in the Q-Q plot will approximately lie on a straight line. There isn’t a hard and fast rule to determine how much deviation from the straight line is too much, but if the distributions under comparison are very different, it will be obvious in the Q-Q plot. We can construct a Q-Q plot from scratch using matplotlib as follows:

"""
Q-Q plot construction - Compare theoretical quantiles (x-axis)
against empirical quantiles (y-axis). 
"""
import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt

np.set_printoptions(
    suppress=True, nanstr='NaN', infstr='Inf', precision=8
    )

plt.rcParams["axes.edgecolor"] = "#000000"
plt.rcParams["axes.linewidth"] = 1
plt.style.use("ggplot")


dat_init = [
    62.55976, -14.71019, -20.67025, -35.43758, -10.65457,  21.55292, 
    41.26359,   0.33537, -14.43599, -40.66612,   6.45701, -40.39694, 
     55.1221,  24.50901,   6.61822, -29.10305,   6.21494,  15.25862,  
    13.54446,   2.48212,  -2.34573, -21.47846,   -5.0777,  26.48881, 
    -8.68764,  -5.49631,  42.58039,  -6.59111, -23.08169,  19.09755, 
   -21.35046,   0.24064,  -3.16365, -37.43091,  24.48556,    2.6263,  
    31.14471,   5.75287,  -46.8529, -14.26814,   8.41045,  18.11071, 
   -30.46438,  12.22195, -31.83203,  -8.09629,  52.06456, -24.30986, 
   -25.62359,   2.86882,  15.77073,  31.17838, -22.04998
    ]


dat      = np.array(dat, dtype=np.float_); dat.sort()
datcdf   = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
ndist    = stats.norm(loc=0, scale=1)
theo     = ndist.ppf(datcdf)

# Remove observations containing np.Inf.
pairs = zip(theo, dat)
pairs = [i for i in pairs if np.Inf not in i]
x, y  = zip(*pairs)

# Generate Q-Q plot.
fig = plt.figure()
fig.set_size_inches(11, 8)
ax  = fig.add_subplot(111)
ax.scatter(x, y, color="#FFFFFF", s=61, edgecolor="#000000", marker="o", linewidths=1.03)
ax.set_title("Q-Q Plot: Empricial Data vs. Standard Normal Distribution", 
             fontsize=16,loc="left",color="#FF6666",weight="bold")
ax.set_ylabel("Empirical Quantiles", fontsize=14,color="#000000")
ax.set_xlabel("Standard Normal Quantiles", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None) # change background color
fig.tight_layout()
plt.show()

Running the code produces the following scatterplot:


qqplot



The points seem to mostly follow a straight line, but there are observations that deviate from strict linearity. But there’s nothing here that disqualifies our dataset from being modeled with normal distribution.

P-P Plot

The P-P plot compares two cumulative distribution functions against each other. To produce a P-P plot, we plot the theoretical percentiles (x-axis) against empirical percentiles (y-axis), so that each axis ranges from 0-1. The comparison line is the 45 degree line from (0,0) to (1,1). The distributions are equal if and only if the plot falls on this line: any deviation indicates a difference between the distributions. The code to produce a P-P plot is given below (assume the same library imports and configuration settings in the Q-Q plot example):

"""
P-P Plot of empirical data vs. standard normal distribution.
"""
# Standardize dat.
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
sdat     = (dat - dat_mean) / dat_stdv
sdatcdf  = np.arange(1, sdat.size + 1) / sdat.size
ndist    = stats.norm(loc=0, scale=1)
theocdf  = ndist.cdf(sdat)

# Remove any observations containing np.Inf.
pairs = zip(theocdf, sdatcdf)
pairs = [i for i in pairs if np.Inf not in i]
x, y  = zip(*pairs)

# Generate P-P plot.
fig = plt.figure()
ax  = fig.add_subplot(111)
ax.scatter(x, y, color="#FFFFFF", s=61, edgecolor="#000000", marker="o", linewidths=1.03)
ax.plot([0, 1], [0, 1], color="#F64D54", linewidth=2.5, alpha=.85)
ax.set_title("P-P Plot: Empricial CDF vs. Standard Normal CDF", 
             fontsize=16, loc="left", color="#FF6666", weight="bold")
ax.set_ylabel("Empirical Cumulative Distribution", fontsize=14,color="#000000")
ax.set_xlabel("Standard Normal Distribution", fontsize=14,color="#000000")
ax.grid(True)
ax.set_facecolor(None) # changes background color to blue
ax.set_xlim(-0.1, 1.1)
ax.set_ylim(-0.1, 1.1)
fig.tight_layout()
plt.show()

The resulting visualization:


qqplot



There is some deviation from \(y=x\), but again, nothing to conclusively discount the possibility of our data arising from a normal population. We expect some deviation from the expected normal percentiles, which we see in this diagnostic.

Histogram with Parametric Model Overlay

This visualization is exactly as the section title suggests: A histogram representing the density of the empirical data overlaid with a parameterized normal distribution. Depending on the number of observations in your dataset, you might need to experiment with the number of bins to use. Also, be sure to set density=True in ax.hist:

"""
Plot histogram with best-fit normal distribution overlay.
"""
endpoint = np.max([np.abs(dat.min()), np.abs(dat.max())])
xvals    = np.linspace(-endpoint - 10, endpoint + 10, 1000)
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
fitnorm  = stats.norm(loc=dat_mean, scale=dat_stdv)
yvals    = fitnorm.pdf(xvals)

# Remove any observations containing np.Inf.
pairs = zip(xvals, yvals)
pairs = [i for i in pairs if np.Inf not in i]
x, y  = zip(*pairs)

# Generate histogram with parametric model overlay.
fig = plt.figure()
fig.set_size_inches(9, 6)
ax  = fig.add_subplot(111)
ax.hist(dat, 18, density=True, alpha=.995, color="#FFFFFF", edgecolor="black", linewidth=1.1)
ax.plot(x, y, color="#F64D54", linewidth=3.)
ax.set_title("Distribution of Empirical Data", fontsize=16, 
             loc="left", color="#FF6666", weight="bold")
ax.set_ylabel("Density", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None)
fig.tight_layout()
plt.show()

This code produces:


qqplot



We see that the parametric overlay approximately agrees with the density histogram, save for one larger-than-expected bin between x-axis points (0, 20). However, this may be a function of the number of bins selected (18 in this case), but is nonetheless worth keeping in mind as we move on to quantitative goodness-of-fit tests.

Kolmogorov-Smirnov Test

Suppose that we have a set of empirical data that we assume originates from a distribution \(F\). The Kolmogorov-Smirnov statistic is used to test:

\(H_{0}\): the samples come from \(F\).

against:

\(H_{1}\): The samples do not come from \(F\).

We want to compare the empirical distribution function of the data, \(F_{obs}\), with the cumulative distribution function associated with the null hypothesis, \(F_{exp}\) (the expected CDF).

The Kolmogorov-Smirnov statistic is given by

$$ D_{n} = max|F_{exp}(x) - F_{obs}(x)|, $$

Assuming the data are ordered, \(x_{1}\) being the least and \(x_{n}\) being the largest, we represent the empirical CDF as:

$$ F_{obs}(x_{i}) = \frac{i}{n}, $$

Where \(n\) is the number of empirical data observations.

For each observation, compute the absolute differences between \(F_{exp}(x)\) and \(F_{obs}(x)\). The Kolmogorov-Smirnov statistic \(D_{n}\) is the maximum value in the array of absolute differences. This value represents the maximum absolute distance between the expected and observed distribution functions. \(D_{n}\) is them compared to a table of critical values to assess whether to reject or fail to reject the \(H_{0}\) on the basis of \(D_{n}\).

The Kolmogorov-Smirnov test can be visualized by plotting the expected CDF vs. the empirical CDF as a step function:

"""
K-S plot of empirical CDF vrs. theoretical distribution function. x-axis represents 
standard normal quantiles, the vertical axis the cumulative CDF ranging from [0,1].
"""
dat_mean  = dat.mean()
dat_std   = dat.std(ddof=1, axis=0)
sdat      = (dat - dat_mean) / dat_std # standardized empirical data
sdatcdf   = np.arange(1, sdat.size + 1) / sdat.size
ndist     = stats.norm(loc=0, scale=1)

# Remove any observations containing np.Inf; generate theoretical cdf.
emprpairs = zip(sdat, sdatcdf)
emprpairs = [i for i in emprpairs if np.Inf not in i]
sdatfin, sdatcdffin = zip(*emprpairs)
theocdf  = ndist.cdf(sdatfin)
endpoint = np.max([np.abs(sdat.min()), np.abs(sdat.max())])

# Generate Kolmogorov-Smirnov comparison plot.
fig = plt.figure()
fig.set_size_inches(9, 6)
ax = fig.add_subplot(111)
ax.step(sdatfin, sdatcdffin, color="#F64D54", linewidth=2.5, where="pre", label="Empirical CDF")
ax.plot(sdat, theocdf, color="#FFFFFF", linewidth=2.5, alpha=.975, label="Std. Normal CDF")
ax.set_title("Kolmogorov–Smirnov Test Illustration", fontsize=16, loc="left", 
             color="#FF6666", weight="bold")
ax.set_xlabel("Standard Normal Quantiles", fontsize=14, color="#000000")
ax.set_ylabel("Cumulative Distribution", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None) # change background color to blue
ax.set_xlim(-endpoint, endpoint + .5)
ax.set_ylim(-0.1, 1.1)

# Legend handler
legend = plt.legend(frameon=1, loc=4, fontsize="large", fancybox=True, framealpha=1)
frame  = legend.get_frame()
frame.set_color("#E0E0E0")
frame.set_facecolor("#DCDCDC")
frame.set_edgecolor("#000000")

#ax.legend(loc=4)
fig.tight_layout()
plt.show()

The rendered comparison:


qqplot



The Kolmogorov-Smirnov statistic is simply the greatest absolute distance between the empirical and expected CDFs. We can easily implement logic to perform this comparison from scratch:

"""
Computing the Kolmogorov-Smirnov statistic from scratch.
"""
import numpy as np
from scipy import stats


dat_init = [
    62.55976, -14.71019, -20.67025, -35.43758, -10.65457,  21.55292, 
    41.26359,   0.33537, -14.43599, -40.66612,   6.45701, -40.39694, 
     55.1221,  24.50901,   6.61822, -29.10305,   6.21494,  15.25862,  
    13.54446,   2.48212,  -2.34573, -21.47846,   -5.0777,  26.48881, 
    -8.68764,  -5.49631,  42.58039,  -6.59111, -23.08169,  19.09755, 
   -21.35046,   0.24064,  -3.16365, -37.43091,  24.48556,    2.6263,  
    31.14471,   5.75287,  -46.8529, -14.26814,   8.41045,  18.11071, 
   -30.46438,  12.22195, -31.83203,  -8.09629,  52.06456, -24.30986, 
   -25.62359,   2.86882,  15.77073,  31.17838, -22.04998
    ]

dat = np.array(dat_init, dtype=np.float_)
dat.sort()
datcdf = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_stdv = dat.std()

# Parameterized expected normal distribution.
expnorm = stats.norm(loc=dat_mean, scale=dat_stdv)
expcdf  = expnorm.cdf(dat)

# Compute difference between datcdf and expcdf.
absdiffs = np.abs(datcdf - expcdf)
D0 = absdiffs.max()
# D0: 0.0741554221610411

We can compare our value of D with the value obtained from scipy.stats.kstest, which takes for arguments the empirical dataset and a callable representing the CDF of the expected distribution, and returns the D-statistic as well as the p-value associated with the computed D-statistic (note that critical values depend on the number of observations). Our computed result is given by D0, the result returned from scipy.stats.kstest by D1:

>>> D1, p1 = stats.kstest(dat, expnorm.cdf)
>>> print(f"Manual D: {D0}.")
0.0741554221610411
>>> print(f"kstest D: {D1}.")
0.0741554221610411

The results are identical.

The p-value (the second element of the 2-tuple returned by scipy.stats.kstest) is 0.9326339000931222. How should this result be interpreted?

For the Kolmogorov-Smirnov test, the null hypothesis is that the distributions are the same. Thus, the lower the p-value the greater the statistical evidence you have to reject the null hypothesis and conclude the distributions are different. The test only lets you speak of your confidence that the distributions are different, not the same, since the test is designed to find the probability of Type I error. Therefore, if \(D\) is less than the critical value, we do not reject the null hypothesis (corresponds to a large p-value). If \(D\) is greater than the critical value, we reject the null hypothesis (corresponds to a small p-value).

Given our p-value of 0.9326339000931222, we do not have sufficient evidence to reject the null hypothesis that the distributions are the same.

A table of Kolmogorov-Smirnov critical values can be downloaded here.

The Anderson-Darling Test

We briefly describe the Anderson-Darling test and how to run it in Scipy.

The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution1. It makes use of the fact that, when given a hypothesized underlying distribution and assuming the data does arise from this distribution, the cumulative distribution function (CDF) of the data can be assumed to follow a uniform distribution. The statistic itself can be expressed as:

$$ \begin{align*} A^{2} &= -n - S, \hspace{2mm} \text{where} \\ S &= \sum_{i=1}^{n} \frac{2i-1}{n} \Big[Ln(F(y_{i})) + Ln(1 - F(y_{n+1-i})) \Big] \end{align*} $$

The function scipy.stats.anderson takes as arguments the empirical dataset and a distribution to test against (one of “norm”, “expon”, “logistic”, “gumbel”, “gumbel_l” or “gumbel_rexponential”), and returns the Anderson-Darling test statistic, the critical values for the specified distribution and the significance levels for the corresponding critical values. For example, to test whether dat follows a normal distribution, we run the following:

"""
Demonstration of Anderson-Darling Test in scipy.stats.
"""
from scipy import stats

dat_init = [
    62.55976, -14.71019, -20.67025, -35.43758, -10.65457,  21.55292, 
    41.26359,   0.33537, -14.43599, -40.66612,   6.45701, -40.39694, 
     55.1221,  24.50901,   6.61822, -29.10305,   6.21494,  15.25862,  
    13.54446,   2.48212,  -2.34573, -21.47846,   -5.0777,  26.48881, 
    -8.68764,  -5.49631,  42.58039,  -6.59111, -23.08169,  19.09755, 
   -21.35046,   0.24064,  -3.16365, -37.43091,  24.48556,    2.6263,  
    31.14471,   5.75287,  -46.8529, -14.26814,   8.41045,  18.11071, 
   -30.46438,  12.22195, -31.83203,  -8.09629,  52.06456, -24.30986, 
   -25.62359,   2.86882,  15.77073,  31.17838, -22.04998
    ]

dat = np.array(dat_init, dtype=np.float_)
dat.sort()

# Perform Anderson-Darling test.
A, crit, sig = stats.anderson(dat, dist="norm")

# A: 0.22442637404651578
# crit: [0.54  0.615 0.738 0.861 1.024]
# sig: [15.  10.   5.   2.5  1. ]

According to the stats.anderson documentation, if the returned statistic is larger than the critical values, then for the corresponding significance level the null hypothesis (that the data come from the chosen distribution) can be rejected. Since our statistic 0.22442637404651578 is smaller than all critical values, we do not have sufficient evidence to reject the null hypothesis that the data come from a normal distribution.

A table of Anderson-Darling critical values can be found here.

Shapiro–Wilk Test

The Shapiro–Wilk test null hypothesis is that the sample is drawn from a normally distributed population. The function scipy.stats.shapiro takes the empirical dataset as it’s sole argument, and similar to scipy.stats.kstest returns a 2-tuple containing the test statistic and p-value.

"""
Demonstration of Shapiro-Wilk Test in scipy.stats.
"""
from scipy import stats

dat_init = [
    62.55976, -14.71019, -20.67025, -35.43758, -10.65457,  21.55292, 
    41.26359,   0.33537, -14.43599, -40.66612,   6.45701, -40.39694, 
     55.1221,  24.50901,   6.61822, -29.10305,   6.21494,  15.25862,  
    13.54446,   2.48212,  -2.34573, -21.47846,   -5.0777,  26.48881, 
    -8.68764,  -5.49631,  42.58039,  -6.59111, -23.08169,  19.09755, 
   -21.35046,   0.24064,  -3.16365, -37.43091,  24.48556,    2.6263,  
    31.14471,   5.75287,  -46.8529, -14.26814,   8.41045,  18.11071, 
   -30.46438,  12.22195, -31.83203,  -8.09629,  52.06456, -24.30986, 
   -25.62359,   2.86882,  15.77073,  31.17838, -22.04998
    ]

dat = np.array(dat_init, dtype=np.float_)
dat.sort()

W, p = stats.shapiro(dat)

# W: 0.9804515242576599
# p: 0.5324678421020508

If the p-value is less than the chosen alpha level, then the null hypothesis is rejected, and there is evidence that the data tested are not normally distributed. If the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected (e.g., for an alpha level of 0.05, a data set with a p-value of 0.05 rejects the null hypothesis that the data are from a normally distributed population 2.

The Shapiro-Wilk p-value computed on dat is 0.5324678421020508, therefore we cannot reject the null hypothesis that the data came from a normally distributed population.

A table of Shapiro-Wilk critical values can be downloaded here.

Conclusion

In this post we introduced multiple test that can be used to assess how well empirical data fits a parametric distribution. Keep in mind that even though none of the tests rejected outright the hypothesis that dat originated from a normal population, this is not confirmation that our dataset is actually normal. The difference is subtle, but important.

Until next time, happy coding!

  1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html
  2. https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test#Interpretation