Kernel Density Estimation is a non-parametric method used primarily to estimate the probability density function of a collection of discrete data points. If \((x_{1}, x_{2}, \cdots, x_{n})\) represent an independent and identically distributed sample drawn from an unknown distribution, then the kernel density estimator estimates the shape of the density function from the unknown distribution.

The general form of a kernel-smoothed density function can be represented as:

$$ \hat f(x) =\sum_{x_{i}} k_{x_{i}}(x)f_{n}(x_{i}), $$

where \(f_{n}(x_{i})\) is the probability of point \(x_{i}\) in the empirical distribution (usually \(\frac{1}{n}\)).

Here are a few key definitions that will be useful throughout the remainder of the post:

  • \(K_{x_{i}}(x)\) represents the cumulative distribution function of the distribution used for point \(x_{i}\) and evaluated at \(x\).

  • \(k_{x_{i}}(x)\) represents the probability density function of the distribution used for point \(x_{i}\) and evaluated at \(x\).

  • \(\hat{F}(x) = \sum_{i=1}^{n}\Big(\frac{1}{n}\Big)K_{x_{i}}(x)\) represents the kernel-smoothed distribution function.

  • \(\hat{f}(x) = \sum_{i=1}^{n}\Big(\frac{1}{n}\Big)k_{x_{i}}(x)\) represents the kernel-smoothed density function.

Kernel smoothing consists of selecting a distribution to use for each point \(x_{i}\) in \((x_{1}, x_{2}, \cdots, x_{n})\). Then the kernel-smoothed density estimator is an equally-weighted mixture of the \(n\) densities.

The kernel (or basis function) is the distribution for which \(k_{x_{i}}(x)\) is the density function and \(K_{x_{i}}(x)\) is the distribution function. The kernel can be one of a wide range of distributions, but in this post, discussion will be limited to the Gaussian kernel. The Gaussian kernel has support over the entire real line, and as the name suggests is related to the normal distribution.

A required input for kernel density estimation is the bandwidth, which controls the level of smoothing for a kernel denisty estimate. Generally, as bandwidth increases, more nuanced detail tends to be smoothed over, obscuring much of the underlying structure. Later we’ll see how changing bandwidth affects the overall appearance of a kernel density estimate.

For Gaussian kernel functions, the optimal choice for bandwidth is:

$$ h = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{\frac{1}{5}} \approx 1.06 \hat{\sigma} n^{-1/5} $$

where \({\displaystyle}{\hat{\sigma }}\) is the standard deviation of the samples. This approximation is termed the normal distribution approximation, Gaussian approximation, or Silverman’s rule of thumb.

Note that the most common method of assessing the optimality of a particular bandwidth selection is the mean integrated squared error. However, this post is focused more on implementing kernel density estimators rather than determining an optimal bandwidth selection. Therefore, the only bandwidth evaluation we will perform will be qualitative and visual in nature.

When using Gaussian kernel function, the bandwidth is interpreted as the standard deviation of a normal distribution with mean equal to the data point of interest. With this parameterization, the density estimate is continuous and has infinite support.

The Gaussian kernel density estimator is given by:

$$ k_{x_{i}}(x) = \begin{cases} \frac {1}{\sqrt {2 \pi b^{2}}}e^{\frac {(x - x_{i})^{2}}{2b^{2}}} & -\infty \lt x \lt +\infty \end{cases} $$

This can be easily implemented in Python as follows:

import numpy as np
from scipy import stats

def gaussian_pdf(x_i, bandwidth):
    Return Gaussian kernel density estimator.
    dist = stats.norm(loc=x_i, scale=bandwidth)
    def evaluate(x):
        """Evaluate x."""

def kde_pdf(data, kernel_func, bandwidth):
    Generate kernel density estimator over data.
    kernels = dict()
    n = len(data)
    for d in data:
        kernels[d] = kernel_func(d, bandwidth)
    def evaluate(x):
        """Evaluate `x` using kernels."""
        pdfs = list()
        for d in data: pdfs.append(kernels[d](x))
        return(sum(pdfs) / n)

This is a normal density with mean x_i and standard deviation bandwidth. Let’s visualize how the density estimate changes as a function of bandwidth:

import matplotlib
import matplotlib.pyplot as plt

yvals  = [5, 12, 15, 20]
xvals = np.arange(min(yvals), max(yvals), .01)

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6), tight_layout=True) 
plt_indices = [(0, 0), (0, 1), (1, 0), (1, 1)]

for (ii, jj), bw in zip(plt_indices, [1, 2, 3, 4]):
    ax[ii, jj].set_title(
        f"Gaussian KDE: Bandwidth={bw}", color="#000000", loc="center",
    dist = kde_pdf(data=yvals, kernel_func=gaussian_pdf, bandwidth=bw)
    yy1 = [dist(ii) for ii in xvals]
    ys1 = [dist(ii) for ii in yvals]
    ax[ii, jj].scatter(yvals, ys1)
    ax[ii, jj].plot(xvals, yy1)
    ax[ii, jj].tick_params(axis="x", which="major", labelsize=7)
    ax[ii, jj].tick_params(axis="y", which="major", labelsize=7)
    ax[ii, jj].xaxis.set_ticks_position("none")
    ax[ii, jj].yaxis.set_ticks_position("none")
    ax[ii, jj].grid(True)
    ax[ii, jj].set_axisbelow(True)

Running the code above produces the following:


It is worth metioning that scipy exposes a gaussian_kde object for kernel density estimates that use a Gaussian kernel. More information is available here.