Why We Use n-1 Instead of n When Calculating Variance
When you calculate variance, you take the squared differences from the mean, and average them out. But, unlike how we've been taught in school, when dealing with a sample, you divide the total by
The Intuition
When you calculate the sample mean deviation (variance), you’re using the data itself to define the center. This creates a bias—your sample mean is tailor-made to be as close as possible to your data points. Thus, it inherently makes deviations look smaller than they truly are.
Think of it this way:
Your sample mean is a compromiser. It's the value closest to every element of your sample, but not of the true population. It sits where it does to minimize the drama (i.e., squared differences). If you had the true population mean (which you never do), deviations from it would be larger.
So if you divide by
Example Time
You’re estimating Pokémon card counts in a school. You sample three kids:
- Friend A: 50 cards
- Friend B: 60 cards
- Friend C: 70 cards
Sample mean = 60. Deviations: -10, 0, +10. Squared: 100, 0, 100.
If you divide by
But this represents only the variance for your sample, i.e. three friends. In reality, say the total population is the entire school. There may be a kid with 40 cards, another with 75 cards.
Inevitably, the true population variance (if you could measure everyone) will always be higher. In this case, if you add the two other kids:
Deviations: -20, -10, 0, +10, +15. Squared: 400, 100, 0, 100, 225.
You can see that it's quite different from the result you get when dividing the sample mean by
Now, if you divide by
Sure, it's not the real variance either, but it's definitely closer to it than the initial result.
The Takeaway
Use