Posted on June 23, 2025

Why do we divide by n-1 when estimating the variance?

Two answers to the question every first-year stats student asks

In every introductory statistics course, we learn the following estimators of the mean and variance of some distribution \(P\) from which we have \(n\) samples \(x_1, \dots, x_n\).

The first moment \(\mu_1\), i.e. the mean \(\mathbb{E}[X]\) is straightforward to estimate from samples as

\[\widehat{\mu_1} = \frac{1}{n} \sum_{i=1}^n x_i\]

For the second moment, i.e. \(\mathbb{E}_{X \sim P}[X^2]\) (this doesn’t have a standardized notation like the mean, so I’ll call it \(\mu_2\) for ‘moment 2’) we also have the straightforward estimator

\[\widehat{\mu_2} = \frac{1}{n} \sum_{i=1}^n x_i\]

But the variance $[(X - [X])^2] sometimes catches people by surprise. Naive pattern-matching would suggest its estimator would be

\[\widehat{\nu_2} = \frac{1}{n} \sum_{i=1}^n (x_i - \frac{1}{n}\sum_{i=1}^n x_i)^2\]

which is in fact wrong. Well, technically I don’t know if an estimator can be wrong as that’s a bit of a subjective judgement, but it is definitely biased and will result in you under-estimating the actual variance. There are a number of ways to show that this is wrong, but the simplest is to take \(n=1\), i.e. the case where we only have a single sample from the distribution. Then the quantity \(\frac{1}{n} \sum_{i=1}^n (x_i - \frac{1}{n}\sum_{i=1}^n x_i)^2 = 0\) no matter what the distribution we are sampling from. This is not a desirable property in an estimator, so we can safely discard the naive \(\frac{1}{n}\) factor.

We’ve shown that \(\frac{1}{n}\) is wrong, but we haven’t shown why it fundamentally should be wrong, nor have we shown that the \(\frac{1}{n-1}\) scaling factor is correct. I’ll get to the final point in a moment, but it’s worth meditating for a moment on why dividing by \(n\) isn’t the right thing to do. We take an average over measurements when we think that those measurements are, on average (pardon the redundancy), capturing the underlying quantity they are measuring. The mean estimator, for example, averages out a collection of unbiased estimates of the expected value by averaging over samples. From another perspective, the sample mean is the point that minimizes the average distance to sampled points. But this average distance from the distribution’s expected value \(\mu\) to sampled points is the variance that we want to estimate. Consequently, if we use the sample mean to estimate the average distance between sampled points and the actual expected value of the distribution, we are going to get an under-estimate.

I’ll now show two ways of proving that the factor by which the sample variance gives an under-estimate is precisely the \(\frac{n-1}{n}\) factor that results in us needing to divide by \(n-1\) to get an unbiased estimator.

Proof 1: decompose the sum

A simple way to show that you need to divide by \(\frac{1}{n-1}\) to get an unbiased estimate of a distribution’s variance is to compute the expected value of the sample variance explicitly and see if it’s off by a factor of \(\frac{n-1}{n}\). Recall, we want to compute

\[\sigma^2 = \mathbb{E}[(x_i - \mu_1)^2]\]

If we know \(\mu_1\), then

\[\widehat{\nu_2} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_1)^2 \]

is an unbiased estimator, which is easy to show as

\[\mathbb{E}[\widehat{\nu_2}] = \mathbb{E}[\frac{1}{n} \sum_{i=1}^n (x_i - \mu_1)^2] =\frac{1}{n} \sum_{i=1}^n \mathbb{E}[(x_i - \mu_1)^2] = \mathbb{E}[(x_i - \mu_1)^2] \]

If we don’t know the mean, we have to use our best guess.

\[\begin{align} \mathbb{E}[\widehat{\nu_2}] &= \mathbb{E}[\frac{1}{n} \sum_{i=1}^n (x_i - \widehat{\mu}_1)^2] \\ & =\frac{1}{n} \sum_{i=1}^n \mathbb{E}[x_i^2 - 2x_i \widehat{\mu_1} + \widehat{\mu_1}^2 ] \\ &= \mathbb{E}[x_i^2 -2x_i \widehat{\mu_1} + \widehat{\mu_1}^2 ] \\ &= \mathbb{E}[x_i^2] -2\mathbb{E}[x_i \widehat{\mu_1}] + \mathbb{E}[\widehat{\mu_1}^2 ] \\ \end{align} \] Now, unfortunately for us we can’t compute \(\mathbb{E}[x_i \widehat{\mu_1}]\) as \(\mathbb{E}[x_i ]\mathbb{E}[\widehat{\mu_1}]\) because the estimator \(\mu_1\) depends on the value of \(x_i\). So instead we have to divvy up the sum into a set of independent variables where we do get \(\mathbb{E}[x_ix_j] = \mathbb{E}[x_i] \mathbb{E}[x_j] = \mathbb{E}[X]^2\), and self-product terms of the form \(\mathbb{E}[x_i^2]\). \[ \begin{align} \mathbb{E}[\widehat{\nu_2}] &= \mathbb{E}[x_i^2] -2\mathbb{E}[\frac{1}{n}\sum_{j=1}^n x_ix_j] + \mathbb{E}[(\frac{1}{n}\sum_{i=1}^n x_i)^2] \\ &= \mathbb{E}[x_i^2] -2(\mathbb{E}[\frac{1}{n}\sum_{j \neq i}x_ix_j] + \frac{1}{n}\mathbb{E}[x_i^2]) + \frac{1}{n^2}(n\mathbb{E}[\sum_{i=1}^{n-1}x_nx_i] + n\mathbb{E}[x_n^2]) \\ &= \mathbb{E}[x_i^2] -2(\frac{n-1}{n}\mathbb{E}[x_i]^2 + \frac{1}{n}\mathbb{E}[x_i^2]) + \frac{n-1}{n}\mathbb{E}[x_i]^2 + \frac{1}{n}\mathbb{E}[x_i^2] \\ &= \frac{n-1}{n} [\mathbb{E}[X_i^2] - \mathbb{E}[X_i]^2] \end{align} \]

Proof 2: use the error of the sample mean estimator

Another intuition we can use instead is that the estimated sample mean will not be exactly equal to the true mean, and will by construction be closer to the sampled points than the population mean. We can compute the expected error of the sample mean estimator as

\[ \begin{align} \mathbb{E}[(\mu - \widehat{\mu}_1)^2] &= \mathbb{E}[(\mu - \frac{1}{n} \sum_{i=1}^n x_i)^2] \\ &= \mathbb{E}[\mu^2 - 2\mu\sum_{i=1}^n \frac{x_i}{n} + (\frac{1}{n} \sum_{i=1}^n x_i)^2]\\ &= \mu^2 - 2\mathbb{E}[x_i]\mu + \mathbb{E}[\frac{1}{n^2}(\sum_{i=1}^n x_i)^2] \\ &= -\mu^2 + \mathbb{E}[\sum_{i=1}^n \sum_{j \neq i} x_ix_j + n\sum_{i=1}^n x_i^2] \\ &= -\mu^2 + \frac{1}{n^2}(\sum_{i=1}^n \sum_{j \neq i} \mathbb{E}[x_i]^2 + \sum_{i=1}^n \mathbb{E}[x_i^2] )\\ &= -\mu^2 \frac{n-1}{n} + \mu^2 + \frac{1}{n}[\mu_2] \\ &= \frac{ \mu_2 - \mu^2}{n} \\ &= \frac{\sigma^2}{n} \end{align} \]

So we see that the estimated mean will be off by an average of \(\frac{\sigma^2}{n}\), a factor we need to account for if we want to estimate the average distance between sample points and the true mean. In particular, this leaves us with \[ \begin{align} \sigma^2 = \mathbb{E}[\frac{1}{n}\sum_{i=1}^n(\mu -x_i)^2 ] &= \mathbb{E}[\frac{1}{n}\sum_{i=1}^n({\color{blue}{\mu - \widehat{\mu}}} + {\color{red}{\widehat{\mu} - x_i}})^2] \\ &= \mathbb{E}[\frac{1}{n}\sum_{i=1}^n(({\color{blue}{\mu - \widehat{\mu}}})^2 + 2({\color{blue}{\mu - \widehat{\mu}}})(\color{red}{\widehat{\mu} - x_i}) + (\color{red}{\widehat{\mu} - x_i})^2)] \\ &= \frac{\sigma^2}{n}+ \frac{2}{n}\mathbb{E}[\sum_{i=1}^n({\color{blue}{\mu - \widehat{\mu}}})({\color{red}{\widehat{\mu} - x_i}})] + \mathbb{E}[(\color{red}{\widehat{\mu} - x_i})^2] \\ \sigma^2 &= \frac{\sigma^2}{n} + 0 + \mathbb{E}[(\color{red}{\widehat{\mu} - x_i})^2] \\ \implies \mathbb{E}[\frac{1}{n}\sum_{1=1}^n(\color{red}{\widehat{\mu} - x_i})^2] &= \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2 \end{align} \]

where we have that the term

\(\sum_{i=1}^n((\mu - \widehat{\mu}))(\widehat{\mu} - x_i) = ((\mu - \widehat{\mu}))\sum_{i=1}^n(\color{red}{\widehat{\mu} - x_i}) =((\mu - \widehat{\mu}))(n \widehat{\mu} - \sum x_i) = 0\)

Concluding remarks

So there you have it: two different perspectives on why we divide the sample variability by \(n-1\) instead of \(n\) to estimate the variance. From the random variable perspective, the missing \(\frac{1}{n}\) term comes from the fact that the non-zero component of the correlation between the random sample \(x_i\) and the estimated mean \(\frac{1}{n}\sum_{j=1}^n{x_j}\) comes from the \(\frac{1}{n}\mathbb{E}[x_i^2]\) term, and this correlation will reduce the observed variance. From a predictive accuracy perspective, we know that the expected error of the mean estimator is \(\frac{\sigma^2}{n}\) and the direction of this error will, by construction, always be closer to your data than the actual mean and so lead to an underestimation of the variance.

I’m assuming universities, slow-moving as they are, haven’t updated their curricula to reflect that AI is now distributed systems engineering (as opposed to statistics, or before that search algorithms) and are still teaching statistics, so I hope this diversion was helpful to the odd student who stumbles on my website. Otherwise, it was at least a fun excuse for me to write multi-line equations for the first time in a very long while, and I thank whatever readers this blog has for putting up with my indulgences.