Online Mean and Variance Update without Full Recalculation
Updating sample mean and variance to account for new observations without full recalculation
Statistical Modeling
Python
Published
February 18, 2024
Imagine you are responsible for maintaining the mean and variance of a dataset that is frequently updated. For small-to-moderately sized datasets, much thought might not be given to the method used for recalculation. However, with datasets consisting of hundreds of billions or trillions of observations, full recomputation of the mean and variance at each refresh may require significant computational resources that may not be available.
Fortunately it isn’t necessary to perform a full recalculation of mean and variance when accounting for new observations. Recall that for a sequence of \(n\) observations \(x_{1}, \dots x_{n}\) the sample mean \(\mu_{n}\) and variance \(\sigma_{n}^{2}\) are given by:
A new observation \(x_{n+1}\) becomes available. To calculate the updated mean \(\mu_{n+1}\) and variance \(\sigma_{n+1}^{2}\) in light of this new observation without requiring full recalculation, we can use the following:
Note that the variance returned using the online update formula is the population variance. In order to return the updated unbiased sample variance, we need to multiply the variance returned by the online update formula by \((n+1)/n\), where \(n\) represents the length of the original array excluding the new observation. Thus, the updated sample variance after accounting for the new value is:
A straightforward implementation in Python to handle online mean and variance updates, incorporating Bessel’s correction to return the unbiased sample variance is provided below:
import numpy as npdef online_mean(mean_init, n, new_obs):""" Return updated mean in light of new observation without full recalculation. """return((n * mean_init + new_obs) / (n +1))def online_variance(var_init, mean_new, n, new_obs):""" Return updated variance in light of new observation without full recalculation. Includes Bessel's correction to return unbiased sample variance. """return ((n +1) / n) * (((n * var_init) / (n +1)) + (((new_obs - mean_new)**2) / n))a0 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047])a1 = np.array([1154, 717, 958, 1476, 889, 1414, 1364, 1047, 1251])# Original mean and variance.mean0 = a0.mean() # 1127.38variance0 = a0.var() # 65096.48# Full recalculation mean and variance with new observation.mean1 = a1.mean() # 1141.11variance1 = a1.var(ddof=1) # 59372.99# Online update of mean and variance with bias correction.mean2 = online_mean(mean0, a0.size, 1251) # 1141.11variance2 = online_variance(variance0, mean2, a0.size, 1251) # 66794.61print(f"Full recalculation mean : {mean1:,.5f}")print(f"Full recalculation variance: {variance1:,.5f}")print(f"Online calculation mean : {mean2:,.5f}")print(f"Online calculation variance: {variance2:,.5f}")
Full recalculation mean : 1,141.11111
Full recalculation variance: 66,794.61111
Online calculation mean : 1,141.11111
Online calculation variance: 66,794.61111