Imagine you are responsible for maintaining the mean and variance of a dataset that is frequently updated (new observations can be added every minute or less). For small-to-moderately sized datasets, you might not give much thought to the method used for the recalculation. However, with datasets consisting of hundreds of millions or billions of observations, full recomputation of the mean and variance at each refresh may require significant computational resources that may not be available.

Fortunately it isn’t necessary to perform a full recalculation of mean and variance when accounting for new observations. Recall that for a sequence of $$n$$ observations $$x_{1}, \cdots x_{n}$$,the sample mean $$\mu_{n}$$ and variance $$\sigma_{n}^{2}$$ are given by:

\begin{align*} \mu_{n} &= \sum_{i=1}^{n} x_{i} \\ \sigma_{n}^{2} &= \frac{1}{n-1}\sum_{i=1}^{n} (x_{i} - \mu_{n})^{2} \end{align*}

A new observation $$x_{n+1}$$ becomes available. To calculate the updated mean ($$\mu_{n+1}$$) and variance ($$\sigma_{n+1}^{2}$$) in light of this new observation without requiring full recalculation, we can use the following:

\begin{align*} \mu_{n+1} &= \frac{1}{n+1}(n\mu_{n} + x_{n+1}) \\ \sigma_{n+1}^{2} &= \frac{n}{n+1}\sigma_{n}^{2} + \frac{1}{n}(x_{n+1} - \mu_{n+1})^{2} \end{align*}

Demonstration

Consider the following values:

$$1154\hspace{2mm}717\hspace{2mm}958\hspace{2mm}1476\hspace{2mm}889\hspace{2mm}1414\hspace{2mm}1364\hspace{2mm}1047$$

The mean and variance for the observations:

\begin{align*} \mu_{8} &\approx 1127.38 \\ \sigma_{8}^{2} &\approx 65096.48 \end{align*}

A new value, $$1251$$ becomes available. Full recalculation of mean and variance yields:

\begin{align*} \mu_{8} &\approx \mathbf{1141.11} \\ \sigma_{8}^{2} &\approx \mathbf{59372.99} \end{align*}

The mean and variance calculated using online update results in:

\begin{align*} \mu_{9} &= \frac{1}{9}(8(1127.38) + 1251) \approx \mathbf{1141.11} \\ \sigma_{9}^{2} &= \frac{8}{9} (65096.48) + \frac{1}{8}(1251 - 1141.11)^{2} \approx \mathbf{59372.99}, \end{align*}

which confirms agreement between the two approaches.

Note that the variance returned using the online update formula is the population variance. In order to return the updated unbiased sample variance, we need to multiply the variance returned by the online update formula by $$(n+1)/n$$, where $$n$$ represents the length of the original array excluding the new observation. Thus, the updated sample variance after accounting for the new value is:

\begin{align*} s_{n+1}^{2} &= \frac{n+1}{n}\big(\frac{n}{n+1}\sigma_{n}^{2} + \frac{1}{n}(x_{n+1} - \mu_{n+1})^{2}\big) \\ s_{9}^{2} &= \frac{9}{8}(59372.99) \approx 66794.61 \end{align*}

Implementation

A straightforward implementation in Python to handle online mean and variance updates, incorporating Bessel’s correction to return the unbiased sample variance:

import numpy as np

def new_mu(mu_init, n, new_obs):
"""
Return updated mean in light of new observation without
full recalculation.
"""
return((n * mu_init + new_obs) / (n + 1))

def new_var(mu_init, var_init, mu_new, n, new_obs):
"""
Return updated variance in light of new observation without
full recalculation. Includes Bessel's correction to return
unbiased sample variance.
"""
return(((n + 1) / n) * (((n * var_init) / (n + 1)) + (((new_obs - mu_new)**2) /n )))

a0 = np.array([1154,  717,  958, 1476,  889, 1414, 1364, 1047])

a1 = np.array([1154,  717,  958, 1476,  889, 1414, 1364, 1047, 1251])

# Original mean and variance.
mu0 = a0.mean() # 1127.38
var0 = a0.var() # 65096.48

# Full recalculation mean and variance.
mu1 = a1.mean() # 1141.11
var1 = a1.var() # 59372.99

# Online update of mean and variance with variance bias correction.
mu1_ol  = new_mu(mu0, a0.size, 1251)                 # 1141.11
var1_ol = new_var(mu0, var0, mu1_ol, a0.size, 1251)  # 66794.61