When working with discrete-valued datasets, percentiles may not be well-defined. In order to perform percentile matching, we can use the smoothed empirical percentile. Let:
-
\(p\): Target quantile
-
\(n\): Size of sample
-
\(x_{k}\): \(k^{th}\) order statistic of sample
-
\(\hat{\pi}_{p}\): Smoothed 100\(p^{th}\) percentile
-
\(a\): Integer floor of \((n+1)p\), \(\lfloor(n+1)p\rfloor\)
The smoothed empirical percentile is then given by:
To illustrate, consider the following dataset:
To calculate the 65th percentile, first sort values in ascending order. Then:
-
Compute \((n+1)*p = 11*.65 = 7.15\)
-
\(a = \lfloor(n+1)p\rfloor = \lfloor7.15\rfloor = 7\)
-
\(x_{(a+1)} = x_{(8)} = 160\) and \(x_{(a)} = x_{(7)} = 135\)
Then substitute these values into the expression for the smoothed empirical percentile:
Implementation
What follows is an implementation of the smoothed empirical percentile in Python:
import math
def smoother(data, p):
"""
Determine the smoothed empirical percentile p
for dataset `data`.
"""
d = sorted(data)
n = len(d)
a = int(math.floor((n+1)*p))
if ((n+1)*p).is_integer():
sep = d[int(((n+1)*p))-1]
else:
sep = ((((n+1)*p)-a)*d[a])+((a+1-(n+1)*p)*d[a-1])
return(sep)
testvals = [25, 55, 60, 75, 85, 110, 135, 160, 165, 185]
smoother(testvals, .65)
# 138.75