Intuition
Take any population - skewed, bimodal, uniform, it doesn’t matter - and repeatedly draw random samples of size . Compute the sample mean each time. As grows, those sample means form a distribution that looks increasingly normal, regardless of what the original population looked like. This is the Central Limit Theorem (CLT), and it is the reason the normal distribution dominates statistics: even when individual data aren’t Gaussian, averages of enough data points are.
Definition
Let be independent and identically distributed (i.i.d.) random variables with mean and finite variance . The CLT states that as :
where is the sample mean and denotes convergence in distribution.
In practice, the approximation is considered reliable when , though the threshold depends on how non-normal the underlying distribution is. Highly skewed populations may need larger .
Key Formulas
Standard error of the mean:
The standard error shrinks as - quadrupling the sample size halves the standard error.
Standardized test statistic:
When is unknown and estimated by the sample standard deviation , use the -distribution instead:
Sum version: The CLT also applies to sums. If , then:
Tip
The CLT explains why many test statistics (z-tests, t-tests) and confidence intervals rely on the normal distribution - even when the raw data are not normal.
Example
Resistor quality control. A factory produces resistors with mean resistance and standard deviation . The individual resistance distribution is right-skewed (not normal). A quality inspector samples resistors and measures the average.
By the CLT, is approximately normal:
What is the probability the sample average exceeds ?
About 6.7% - even though individual resistances are skewed, the CLT lets us use normal probability calculations on the sample mean.
Notice that increasing the sample to would tighten the standard error to , making the same deviation more significant (, ). The CLT quantifies exactly how more data sharpens inference.
Why It Matters in CS
- Monte Carlo simulation: averaging many random simulation runs yields normally distributed estimates, enabling confidence intervals on the result.
- Algorithm analysis: when benchmarking runtime over many random inputs, the mean runtime is approximately normal, justifying Gaussian-based statistical tests for performance comparisons.
- Large-scale data: in big-data pipelines, aggregate statistics (means, counts per partition) behave normally, which simplifies anomaly detection and threshold setting.
- A/B testing: conversion rate differences across thousands of users are approximately normal, which is why z-tests power most A/B testing frameworks.
Related Notes
- Normal Distribution - the distribution the CLT converges to
- Hypothesis Testing - CLT justifies z-tests and t-tests
- Probability Distributions - CLT connects non-normal populations to the normal family
- Bayesian Inference - large-sample posteriors become approximately normal (Bernstein–von Mises theorem)