When working with discrete-valued datasets, percentiles may not be well-defined. In order to perform percentile matching, we use the smoothed empirical percentile 1.

Let:

  • \(p\) = the percentile of interest
  • \(n\) = the size of the sample
  • \(x_{(k)}\) = the \(k^{th}\) order statistic of the sample
  • \(\hat{\pi}_{p}\) = the smoothed 100\(p^{th}\) percentile

Then the smoothed empirical percentile is given by:

$$ \hat{\pi}_{p} = \big((n+1)p -a \big)x_{(a+1)} + \big(a+1-(n+1)p \big)x_{(a)}, $$

where \(a\) represents the integer floor of \((n+1)p\), \(\lfloor(n+1)p\rfloor\).

To illustrate, consider the following dataset:

$$ 25, 55, 60, 75, 85, 110, 135, 160, 165, 185 $$

To calculate the \(65^{th}\) percentile:

  • If not already, sort data elements in ascending order
  • Calculate \((n+1)*p = 11*.65 = 7.15\)
  • From \((n+1)*p\) above, we find \(a = \lfloor(n+1)p\rfloor = \lfloor7.15\rfloor = 7\)
  • Given \(a=7\), \(x_{(a+1)} = x_{(8)}\) = 160 and \(x_{(a)} = x_{(7)} = 135\)

Then substitute these values into the expression for the smoothed empirical percentile:

$$ \begin{equation} \begin{split} \hat{\pi}_{p} & = \big((n+1)p -a \big)x_{(a+1)} + \big(a+1-(n+1)p \big)x_{(a)} \\ & = \big(7.15 - 7 \big)*160 + \big(8-7.15 \big)*135 \\ & = 138.75 \end{split} \end{equation} $$

Implementation

What follows is an implementation of the smoothed empirical percentile logic written in Python:

import math

def smoother(data, p):
    """Determine the smoothed empirical percentile 
       p for dataset `data`."""

    d = sorted(data)
    n = len(d)
    a = int(math.floor((n+1)*p))

    if ((n+1)*p).is_integer():
        sep = d[int(((n+1)*p))-1]

    else:
        sep = ((((n+1)*p)-a)*d[a])+((a+1-(n+1)*p)*d[a-1])

    return(sep)


# test `smoother` =>
testvals = [25, 55, 60, 75, 85, 110, 135, 160, 165, 185]

smoother(testvals, .65)
# returns 138.75





Footnotes:

  1. Klugman, S.A., Panjer, H.H. and Willmot, G.E. Loss Models: From Data to Decisions, Third Edition (2008)