Histograms are used to estimate the probability distribution of a continuous random variable. They are frequently used as an exploratory data analysis starting point, and provide insight into the shape and variability of the data in question. One of the challenges in constructing histograms is selecting the optimal number of bins (or, analagously, the width of each bin). To help determine a reasonable bin width, we can leverage the Freedman-Diaconis rule, which was designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution[1]. Formally, the rule takes as input the interquartile range \(IQR(x)\) and the number of observations \(n\) in the empirical dataset, and returns a bin width estimate. The rule can be expressed as:
The interquartile range is defined as the difference between the largest and
smallest values in the middle 50% of an empirical dataset. Within the context
of Scipy, \(IQR\) can be calculated using stats.iqr
, but it can readily be
calculated by hand.
For the remainder of the post, examples will be with respect to the following dataset:
data = [
62.55976, -14.71019, -20.67025, -35.43758, -10.65457, 21.55292,
41.26359, 0.33537, -14.43599, -40.66612, 6.45701, -40.39694,
55.1221, 24.50901, 6.61822, -29.10305, 6.21494, 15.25862,
13.54446, 2.48212, -2.34573, -21.47846, -5.0777, 26.48881,
-8.68764, -5.49631, 42.58039, -6.59111, -23.08169, 19.09755,
-21.35046, 0.24064, -3.16365, -37.43091, 24.48556, 2.6263,
31.14471, 5.75287, -46.8529, -14.26814, 8.41045, 18.11071,
-30.46438, 12.22195, -31.83203, -8.09629, 52.06456, -24.30986,
-25.62359, 2.86882, 15.77073, 31.17838, -22.04998
]
Using Scipy’s stats.iqr
, the IQR is computed as:
>>> from scipy import stats
>>> stats.iqr(data, rng=(25, 75), scale="raw", nan_policy="omit")
37.12119
Alternatively, we can leverage Numpy’s quantile
function directly, passing
in [.25, .75]
as the target percentiles, then take the difference of the
values to obtain the IQR:
>>> import numpy as np
>>> iqr_manual = np.quantile(data, q=[.25, .75])
>>> np.diff(iqr_manual)[0]
37.12119
Next we define a function that encapsulates the Freedman–Diaconis rule:
import numpy as np
from scipy import stats
def freedman_diaconis(data, returnas="width"):
"""
Use Freedman Diaconis rule to compute optimal histogram bin width.
``returnas`` can be one of "width" or "bins", indicating whether
the bin width or number of bins should be returned respectively.
Parameters
----------
data: np.ndarray
One-dimensional array.
returnas: {"width", "bins"}
If "width", return the estimated width for each histogram bin.
If "bins", return the number of bins suggested by rule.
"""
data = np.asarray(data, dtype=np.float_)
IQR = stats.iqr(data, rng=(25, 75), scale="raw", nan_policy="omit")
N = data.size
bw = (2 * IQR) / np.power(N, 1/3)
if returnas=="width":
result = bw
else:
datmin, datmax = data.min(), data.max()
datrng = datmax - datmin
result = int((datrng / bw) + 1))
return(result)
To demonstrate, we call the freedman_diaconis
function with each returnas
option:
>>> freedman_diaconis(data=data, returnas="width")
19.76483815603517
>>> freedman_diaconis(data=data, returnas="bins")
6
We can use the result to construct a histogram with the suggested number of
bins. Fisrt, using matplotlib’s hist
function:
import matplotlib as mpl
import matplotlib.pyplot as plt
# Use freedman_diaconis function with returnas="bins" to determine histogram bin width.
NBR_BINS = freedman_diaconis(data, returnas="bins")
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(11, 8)
ax.hist(data, NBR_BINS, density=True, color="#FFFFFF", edgecolor="black", linewidth=1.1)
ax.set_title("Bin Width Determination via Freedman Diaconis Rule", fontsize=14, loc="left", color="red")
ax.set_ylabel("Density", fontsize=12, color="#000000")
ax.grid(False)
ax.set_facecolor(None)
plt.tight_layout()
plt.show()
Running this code block generates the following visualization:
Alternatively, the seaborn library can be used to generate similar, but
perhaps more appealing visualizations. We demonstrate seaborn’s distplot
,
which can alternatively include and overlaid kernel density estimate:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
# Use freedman_diaconis function with returnas="bins" to determine histogram bin width.
NBR_BINS = freedman_diaconis(data, returnas="bins")
fig, ax = plt.subplots(nrows=1, ncols=1)
fig.set_size_inches(11, 8)
hist_plot = sns.distplot(data, bins=NBR_BINS, kde=False, hist_kws=dict(edgecolor="#000000", normed=True, linewidth=1.0))
ax.set_title("Bin Width Determination via Freedman Diaconis Rule", fontsize=14, loc="left", color="red")
plt.tight_layout()
plt.show()
# Optionally save plot to file.
hist_plot.savefig("histogram.png")
First, without the density estimate overlay:
And with the density overlay (change
kde
parameter to True):
Remember that the Freedman Diaconis rule isn’t axiomatic: If the suggested bin width/number of bins seems too few or too great, use judgment to scale up or down as needed. More than anything, the rule serves as a starting point for your visualization, from which additional presentation layer modifications can be applied. Until next time, happy coding!
Footnotes:
- https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule