This article explores techniques that can be used to assess how well a parametric model fits an empirical dataset. Specifically, we’ll demonstrate how to produce the following visualizations:
- Q-Q Plot Compares two probability distributions by plotting their quantiles against each other.
- P-P Plot Compares two cumulative distribution functions against each other.
- Histogram Plot density histogram with parametric distribution overlay.
And run the following tests:
- Kolmogorov-Smirnov Test Test the equality of continuous, one-dimensional probability distributions.
- Anderson-Darling Test Test whether a given sample is drawn from a given probability distribution.
- Shapiro-Wilk Test Test the null hypothesis that the data is drawn from a normal distribution.
For ease of exposition, the same dataset will be used in each example, which is given as a Python list below:
dat_init = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
The task is to assess how well our data fits a normal distribution parameterized with mean and variance computed as:
Keep in mind that although we’re testing how well our data can be approximated by a normal distribution, many of the tests we highlight in the forthcoming examples (with the exception of Shapiro-Wilk) can assess the quality of fit for many different parametric models.
We start with assessing goodness-of-fit visually.
Q-Q Plot
The Q-Q plot compares two probability distributions by plotting their
quantiles against each other. We compare standard normal quantiles (x-axis)
against the empirical quantiles from dat
(y-axis). If the two distributions
being compared are similar, the points in the Q-Q plot will approximately lie
on a straight line. There isn’t a hard and fast rule to determine how much
deviation from the straight line is too much, but if the distributions under
comparison are very different, it will be obvious in the Q-Q plot. We can
construct a Q-Q plot from scratch using matplotlib as follows:
"""
Q-Q plot construction - Compare theoretical quantiles (x-axis)
against empirical quantiles (y-axis).
"""
import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
np.set_printoptions(
suppress=True, nanstr='NaN', infstr='Inf', precision=8
)
plt.rcParams["axes.edgecolor"] = "#000000"
plt.rcParams["axes.linewidth"] = 1
plt.style.use("ggplot")
dat_init = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
dat = np.array(dat, dtype=np.float_); dat.sort()
datcdf = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
ndist = stats.norm(loc=0, scale=1)
theo = ndist.ppf(datcdf)
# Remove observations containing np.Inf.
pairs = zip(theo, dat)
pairs = [i for i in pairs if np.Inf not in i]
x, y = zip(*pairs)
# Generate Q-Q plot.
fig = plt.figure()
fig.set_size_inches(11, 8)
ax = fig.add_subplot(111)
ax.scatter(x, y, color="#FFFFFF", s=61, edgecolor="#000000", marker="o", linewidths=1.03)
ax.set_title("Q-Q Plot: Empricial Data vs. Standard Normal Distribution",
fontsize=16,loc="left",color="#FF6666",weight="bold")
ax.set_ylabel("Empirical Quantiles", fontsize=14,color="#000000")
ax.set_xlabel("Standard Normal Quantiles", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None) # change background color
fig.tight_layout()
plt.show()
Running the code produces the following scatterplot:
The points seem to mostly follow a straight line, but there are observations that deviate from strict linearity. But there’s nothing here that disqualifies our dataset from being modeled with normal distribution.
P-P Plot
The P-P plot compares two cumulative distribution functions against each other. To produce a P-P plot, we plot the theoretical percentiles (x-axis) against empirical percentiles (y-axis), so that each axis ranges from 0-1. The comparison line is the 45 degree line from (0,0) to (1,1). The distributions are equal if and only if the plot falls on this line: any deviation indicates a difference between the distributions. The code to produce a P-P plot is given below (assume the same library imports and configuration settings in the Q-Q plot example):
"""
P-P Plot of empirical data vs. standard normal distribution.
"""
# Standardize dat.
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
sdat = (dat - dat_mean) / dat_stdv
sdatcdf = np.arange(1, sdat.size + 1) / sdat.size
ndist = stats.norm(loc=0, scale=1)
theocdf = ndist.cdf(sdat)
# Remove any observations containing np.Inf.
pairs = zip(theocdf, sdatcdf)
pairs = [i for i in pairs if np.Inf not in i]
x, y = zip(*pairs)
# Generate P-P plot.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x, y, color="#FFFFFF", s=61, edgecolor="#000000", marker="o", linewidths=1.03)
ax.plot([0, 1], [0, 1], color="#F64D54", linewidth=2.5, alpha=.85)
ax.set_title("P-P Plot: Empricial CDF vs. Standard Normal CDF",
fontsize=16, loc="left", color="#FF6666", weight="bold")
ax.set_ylabel("Empirical Cumulative Distribution", fontsize=14,color="#000000")
ax.set_xlabel("Standard Normal Distribution", fontsize=14,color="#000000")
ax.grid(True)
ax.set_facecolor(None) # changes background color to blue
ax.set_xlim(-0.1, 1.1)
ax.set_ylim(-0.1, 1.1)
fig.tight_layout()
plt.show()
The resulting visualization:
There is some deviation from \(y=x\), but again, nothing to conclusively discount the possibility of our data arising from a normal population. We expect some deviation from the expected normal percentiles, which we see in this diagnostic.
Histogram with Parametric Model Overlay
This visualization is exactly as the section title suggests: A histogram
representing the density of the empirical data overlaid with a parameterized
normal distribution. Depending on the number of observations in your dataset,
you might need to experiment with the number of bins to use. Also, be sure to
set density=True
in ax.hist
:
"""
Plot histogram with best-fit normal distribution overlay.
"""
endpoint = np.max([np.abs(dat.min()), np.abs(dat.max())])
xvals = np.linspace(-endpoint - 10, endpoint + 10, 1000)
dat_mean = dat.mean()
dat_stdv = dat.std(ddof=1)
fitnorm = stats.norm(loc=dat_mean, scale=dat_stdv)
yvals = fitnorm.pdf(xvals)
# Remove any observations containing np.Inf.
pairs = zip(xvals, yvals)
pairs = [i for i in pairs if np.Inf not in i]
x, y = zip(*pairs)
# Generate histogram with parametric model overlay.
fig = plt.figure()
fig.set_size_inches(9, 6)
ax = fig.add_subplot(111)
ax.hist(dat, 18, density=True, alpha=.995, color="#FFFFFF", edgecolor="black", linewidth=1.1)
ax.plot(x, y, color="#F64D54", linewidth=3.)
ax.set_title("Distribution of Empirical Data", fontsize=16,
loc="left", color="#FF6666", weight="bold")
ax.set_ylabel("Density", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None)
fig.tight_layout()
plt.show()
This code produces:
We see that the parametric overlay approximately agrees with the density histogram, save for one larger-than-expected bin between x-axis points (0, 20). However, this may be a function of the number of bins selected (18 in this case), but is nonetheless worth keeping in mind as we move on to quantitative goodness-of-fit tests.
Kolmogorov-Smirnov Test
Suppose that we have a set of empirical data that we assume originates from a distribution \(F\). The Kolmogorov-Smirnov statistic is used to test:
\(H_{0}\): the samples come from \(F\).
against:
\(H_{1}\): The samples do not come from \(F\).
We want to compare the empirical distribution function of the data, \(F_{obs}\), with the cumulative distribution function associated with the null hypothesis, \(F_{exp}\) (the expected CDF).
The Kolmogorov-Smirnov statistic is given by
Assuming the data are ordered, \(x_{1}\) being the least and \(x_{n}\) being the largest, we represent the empirical CDF as:
Where \(n\) is the number of empirical data observations.
For each observation, compute the absolute differences between \(F_{exp}(x)\) and \(F_{obs}(x)\). The Kolmogorov-Smirnov statistic \(D_{n}\) is the maximum value in the array of absolute differences. This value represents the maximum absolute distance between the expected and observed distribution functions. \(D_{n}\) is them compared to a table of critical values to assess whether to reject or fail to reject the \(H_{0}\) on the basis of \(D_{n}\).
The Kolmogorov-Smirnov test can be visualized by plotting the expected CDF vs. the empirical CDF as a step function:
"""
K-S plot of empirical CDF vrs. theoretical distribution function. x-axis represents
standard normal quantiles, the vertical axis the cumulative CDF ranging from [0,1].
"""
dat_mean = dat.mean()
dat_std = dat.std(ddof=1, axis=0)
sdat = (dat - dat_mean) / dat_std # standardized empirical data
sdatcdf = np.arange(1, sdat.size + 1) / sdat.size
ndist = stats.norm(loc=0, scale=1)
# Remove any observations containing np.Inf; generate theoretical cdf.
emprpairs = zip(sdat, sdatcdf)
emprpairs = [i for i in emprpairs if np.Inf not in i]
sdatfin, sdatcdffin = zip(*emprpairs)
theocdf = ndist.cdf(sdatfin)
endpoint = np.max([np.abs(sdat.min()), np.abs(sdat.max())])
# Generate Kolmogorov-Smirnov comparison plot.
fig = plt.figure()
fig.set_size_inches(9, 6)
ax = fig.add_subplot(111)
ax.step(sdatfin, sdatcdffin, color="#F64D54", linewidth=2.5, where="pre", label="Empirical CDF")
ax.plot(sdat, theocdf, color="#FFFFFF", linewidth=2.5, alpha=.975, label="Std. Normal CDF")
ax.set_title("Kolmogorov–Smirnov Test Illustration", fontsize=16, loc="left",
color="#FF6666", weight="bold")
ax.set_xlabel("Standard Normal Quantiles", fontsize=14, color="#000000")
ax.set_ylabel("Cumulative Distribution", fontsize=14, color="#000000")
ax.grid(True)
ax.set_facecolor(None) # change background color to blue
ax.set_xlim(-endpoint, endpoint + .5)
ax.set_ylim(-0.1, 1.1)
# Legend handler
legend = plt.legend(frameon=1, loc=4, fontsize="large", fancybox=True, framealpha=1)
frame = legend.get_frame()
frame.set_color("#E0E0E0")
frame.set_facecolor("#DCDCDC")
frame.set_edgecolor("#000000")
#ax.legend(loc=4)
fig.tight_layout()
plt.show()
The rendered comparison:
The Kolmogorov-Smirnov statistic is simply the greatest absolute distance between the empirical and expected CDFs. We can easily implement logic to perform this comparison from scratch:
"""
Computing the Kolmogorov-Smirnov statistic from scratch.
"""
import numpy as np
from scipy import stats
dat_init = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
dat = np.array(dat_init, dtype=np.float_)
dat.sort()
datcdf = np.arange(1, dat.size + 1) / dat.size
dat_mean = dat.mean()
dat_stdv = dat.std()
# Parameterized expected normal distribution.
expnorm = stats.norm(loc=dat_mean, scale=dat_stdv)
expcdf = expnorm.cdf(dat)
# Compute difference between datcdf and expcdf.
absdiffs = np.abs(datcdf - expcdf)
D0 = absdiffs.max()
# D0: 0.0741554221610411
We can compare our value of D with the value obtained from
scipy.stats.kstest
, which takes for arguments the empirical dataset and a
callable representing the CDF of the expected distribution, and returns the
D-statistic as well as the p-value associated with the computed D-statistic
(note that critical values depend on the number of observations). Our computed
result is given by D0
, the result returned from scipy.stats.kstest
by
D1
:
>>> D1, p1 = stats.kstest(dat, expnorm.cdf)
>>> print(f"Manual D: {D0}.")
0.0741554221610411
>>> print(f"kstest D: {D1}.")
0.0741554221610411
The results are identical.
The p-value (the second element of the 2-tuple returned by
scipy.stats.kstest
) is 0.9326339000931222. How should this result be interpreted?
For the Kolmogorov-Smirnov test, the null hypothesis is that the distributions are the same. Thus, the lower the p-value the greater the statistical evidence you have to reject the null hypothesis and conclude the distributions are different. The test only lets you speak of your confidence that the distributions are different, not the same, since the test is designed to find the probability of Type I error. Therefore, if \(D\) is less than the critical value, we do not reject the null hypothesis (corresponds to a large p-value). If \(D\) is greater than the critical value, we reject the null hypothesis (corresponds to a small p-value).
Given our p-value of 0.9326339000931222, we do not have sufficient evidence to reject the null hypothesis that the distributions are the same.
A table of Kolmogorov-Smirnov critical values can be downloaded here.
The Anderson-Darling Test
We briefly describe the Anderson-Darling test and how to run it in Scipy.
The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution1. It makes use of the fact that, when given a hypothesized underlying distribution and assuming the data does arise from this distribution, the cumulative distribution function (CDF) of the data can be assumed to follow a uniform distribution. The statistic itself can be expressed as:
The function scipy.stats.anderson
takes as arguments the empirical dataset
and a distribution to test against (one of “norm”, “expon”, “logistic”,
“gumbel”, “gumbel_l” or “gumbel_rexponential”), and returns the
Anderson-Darling test statistic, the critical values for the specified
distribution and the significance levels for the corresponding critical
values. For example, to test whether dat
follows a normal distribution, we
run the following:
"""
Demonstration of Anderson-Darling Test in scipy.stats.
"""
from scipy import stats
dat_init = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
dat = np.array(dat_init, dtype=np.float_)
dat.sort()
# Perform Anderson-Darling test.
A, crit, sig = stats.anderson(dat, dist="norm")
# A: 0.22442637404651578
# crit: [0.54 0.615 0.738 0.861 1.024]
# sig: [15. 10. 5. 2.5 1. ]
According to the stats.anderson
documentation,
if the returned statistic is larger than the critical values, then for the
corresponding significance level the null hypothesis (that the data come from
the chosen distribution) can be rejected. Since our statistic
0.22442637404651578 is smaller than all critical values, we do not have
sufficient evidence to reject the null hypothesis that the data come from a
normal distribution.
A table of Anderson-Darling critical values can be found here.
Shapiro–Wilk Test
The Shapiro–Wilk test null hypothesis is that the sample is drawn from a
normally distributed population. The function scipy.stats.shapiro
takes the
empirical dataset as it’s sole argument, and similar to scipy.stats.kstest
returns a 2-tuple containing the test statistic and p-value.
"""
Demonstration of Shapiro-Wilk Test in scipy.stats.
"""
from scipy import stats
dat_init = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
dat = np.array(dat_init, dtype=np.float_)
dat.sort()
W, p = stats.shapiro(dat)
# W: 0.9804515242576599
# p: 0.5324678421020508
If the p-value is less than the chosen alpha level, then the null hypothesis is rejected, and there is evidence that the data tested are not normally distributed. If the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population can not be rejected (e.g., for an alpha level of 0.05, a data set with a p-value of 0.05 rejects the null hypothesis that the data are from a normally distributed population 2.
The Shapiro-Wilk p-value computed on dat
is 0.5324678421020508, therefore we
cannot reject the null hypothesis that the data came from a normally
distributed population.
A table of Shapiro-Wilk critical values can be downloaded here.
Conclusion
In this post we introduced multiple test that can be used to assess how well
empirical data fits a parametric distribution. Keep in mind that even though
none of the tests rejected outright the hypothesis that dat
originated from
a normal population, this is not confirmation that our dataset is actually
normal. The difference is subtle, but important.
Until next time, happy coding!