The Pleasure of Finding Things Out: A blog by James Triveri - Online Mean and Variance Update without Full Recalculation

Imagine you are responsible for maintaining the mean and variance of a dataset that is frequently updated. For small-to-moderately sized datasets, much thought might not be given to the method used for recalculation. However, with datasets consisting of hundreds of billions or trillions of observations, full recomputation of the mean and variance at each refresh may require significant computational resources that may not be available.

Fortunately it isn’t necessary to perform a full recalculation of mean and variance when accounting for new observations. Recall that for a sequence of \(n\) observations \(x_{1}, \dots x_{n}\) the sample mean \(\mu_{n}\) and variance \(\sigma_{n}^{2}\) are given by:

\[ \begin{align*} \mu_{n} &= \sum_{i=1}^{n} x_{i} \\ \sigma_{n}^{2} &= \frac{1}{n-1}\sum_{i=1}^{n} (x_{i} - \mu_{n})^{2} \end{align*} \]

A new observation \(x_{n+1}\) becomes available. To calculate the updated mean \(\mu_{n+1}\) and variance \(\sigma_{n+1}^{2}\) in light of this new observation without requiring full recalculation, we can use the following:

\[ \begin{align*} \mu_{n+1} &= \frac{1}{n+1}(n\mu_{n} + x_{n+1}) \\ \sigma_{n+1}^{2} &= \frac{n}{n+1}\sigma_{n}^{2} + \frac{1}{n}(x_{n+1} - \mu_{n+1})^{2} \end{align*} \]

Demonstration

Consider the following values:

\[ 1154\hspace{2mm}717\hspace{2mm}958\hspace{2mm}1476\hspace{2mm}889\hspace{2mm}1414\hspace{2mm}1364\hspace{2mm}1047 \]

The mean and variance for the observations:

\[ \begin{align*} \mu_{8} &\approx 1127.38 \\ \sigma_{8}^{2} &\approx 65096.48 \end{align*} \]

A new value, \(1251\) becomes available. Full recalculation of mean and variance yields:

\[ \begin{align*} \mu_{8} &\approx \mathbf{1141.11} \\ \sigma_{8}^{2} &\approx \mathbf{59372.99} \end{align*} \]

The mean and variance calculated using online update results in:

\[ \begin{align*} \mu_{9} &= \frac{1}{9}(8(1127.38) + 1251) \approx \mathbf{1141.11} \\ \sigma_{9}^{2} &= \frac{8}{9} (65096.48) + \frac{1}{8}(1251 - 1141.11)^{2} \approx \mathbf{59372.99}, \end{align*} \]

confirming agreement between the two approaches.

Note that the variance returned using the online update formula is the population variance. In order to return the updated unbiased sample variance, we need to multiply the variance returned by the online update formula by \((n+1)/n\), where \(n\) represents the length of the original array excluding the new observation. Thus, the updated sample variance after accounting for the new value is:

\[ \begin{align*} s_{n+1}^{2} &= \frac{n+1}{n}\big(\frac{n}{n+1}\sigma_{n}^{2} + \frac{1}{n}(x_{n+1} - \mu_{n+1})^{2}\big) \\ s_{9}^{2} &= \frac{9}{8}(59372.99) \approx 66794.61 \end{align*} \]

Implementation

A straightforward implementation in Python to handle online mean and variance updates, incorporating Bessel’s correction to return the unbiased sample variance is provided below:

import numpy as np



def online_mean(mean_init, n, new_obs):
    """
    Return updated mean in light of new observation without
    full recalculation.
    """
    return((n * mean_init + new_obs) / (n + 1))



def online_variance(var_init, mean_new, n, new_obs):
    """
    Return updated variance in light of new observation without
    full recalculation. Includes Bessel's correction to return
    unbiased sample variance. 
    """
    return ((n + 1) / n) * (((n * var_init) / (n + 1)) + (((new_obs - mean_new)**2) / n))




a0 = np.array([1154,  717,  958, 1476,  889, 1414, 1364, 1047])

a1 = np.array([1154,  717,  958, 1476,  889, 1414, 1364, 1047, 1251])


# Original mean and variance.
mean0 = a0.mean()    # 1127.38
variance0 = a0.var() # 65096.48

# Full recalculation mean and variance with new observation.
mean1 = a1.mean()    # 1141.11
variance1 = a1.var(ddof=1) # 59372.99

# Online update of mean and variance with bias correction.
mean2 = online_mean(mean0, a0.size, 1251)                     # 1141.11
variance2 = online_variance(variance0, mean2, a0.size, 1251)  # 66794.61

print(f"Full recalculation mean    : {mean1:,.5f}")
print(f"Full recalculation variance: {variance1:,.5f}")
print(f"Online calculation mean    : {mean2:,.5f}")
print(f"Online calculation variance: {variance2:,.5f}")

Full recalculation mean    : 1,141.11111
Full recalculation variance: 66,794.61111
Online calculation mean    : 1,141.11111
Online calculation variance: 66,794.61111