When working with discrete-valued datasets, percentiles may not be well-defined. In order to perform percentile matching, we can use the smoothed empirical percentile. Let:

  • \(p\): Target quantile

  • \(n\): Size of sample

  • \(x_{k}\): \(k^{th}\) order statistic of sample

  • \(\hat{\pi}_{p}\): Smoothed 100\(p^{th}\) percentile

  • \(a\): Integer floor of \((n+1)p\), \(\lfloor(n+1)p\rfloor\)

The smoothed empirical percentile is then given by:

$$ \hat{\pi}_{p} = \big((n+1)p -a \big)x_{(a+1)} + \big(a+1-(n+1)p \big)x_{(a)}. $$

To illustrate, consider the following dataset:

$$ 25, 55, 60, 75, 85, 110, 135, 160, 165, 185 $$

To calculate the 65th percentile, first sort values in ascending order. Then:

  • Compute \((n+1)*p = 11*.65 = 7.15\)

  • \(a = \lfloor(n+1)p\rfloor = \lfloor7.15\rfloor = 7\)

  • \(x_{(a+1)} = x_{(8)} = 160\) and \(x_{(a)} = x_{(7)} = 135\)

Then substitute these values into the expression for the smoothed empirical percentile:

$$ \begin{equation} \begin{split} \hat{\pi}_{p} & = \big((n+1)p -a \big)x_{(a+1)} + \big(a+1-(n+1)p \big)x_{(a)} \\ & = \big(7.15 - 7 \big)*160 + \big(8-7.15 \big)*135 \\ & = 138.75 \end{split} \end{equation} $$

Implementation

What follows is an implementation of the smoothed empirical percentile in Python:

import math

def smoother(data, p):
    """
    Determine the smoothed empirical percentile p 
    for dataset `data`.
    """

    d = sorted(data)
    n = len(d)
    a = int(math.floor((n+1)*p))

    if ((n+1)*p).is_integer():
        sep = d[int(((n+1)*p))-1]

    else:
        sep = ((((n+1)*p)-a)*d[a])+((a+1-(n+1)*p)*d[a-1])

    return(sep)


testvals = [25, 55, 60, 75, 85, 110, 135, 160, 165, 185]
smoother(testvals, .65)
# 138.75