Intuition
The normal distribution is the symmetric bell curve that shows up whenever many small, independent effects add together. Heights, measurement errors, exam scores - all tend to cluster around a central value with symmetric tails. The curve is entirely described by two numbers: where it is centered and how wide it spreads. This simplicity, combined with the Central Limit Theorem, makes it the single most important distribution in statistics.
Definition
A continuous random variable follows a normal distribution with mean and standard deviation (written ) if its probability density function is:
- controls the center (location) of the bell.
- controls the width (spread); larger means flatter and wider.
- The distribution is symmetric about , so the mean, median, and mode coincide.
The standard normal distribution is the special case . Its CDF is denoted and serves as the universal reference for all normal probabilities.
Key Formulas
Standardizing transformation - convert any normal variable to standard normal:
This lets you look up probabilities in a single -table or use a single CDF .
The 68-95-99.7 rule (empirical rule) - worth memorizing:
| Interval | Probability |
|---|---|
Moment-generating function:
Linear combinations: If are independent, then .
Example
Manufacturing tolerances. A machine produces ball bearings whose diameters follow mm. Bearings outside the tolerance mm are scrapped. What proportion is scrap?
Standardize the upper bound:
By symmetry, the proportion outside tolerance is:
So about 1.24% of production is scrapped - a number that directly informs cost analysis and quality control decisions.
If the process variance drifted to mm, the scrap rate would jump to - demonstrating how sensitive quality is to the spread parameter.
Why It Matters in CS
The 68-95-99.7 rule is burned into every engineer’s brain for a reason: it lets you eyeball whether data is behaving normally without running a formal test. If roughly 5% of your values fall outside two standard deviations, things are probably fine. If 20% do, something interesting is going on.
In ML, the normal shows up constantly. Weight initialization in neural networks samples from because symmetric, light-tailed starting points help gradient flow. Gaussian Mixture Models are just “what if the data came from overlapping bell curves?” Variational autoencoders and diffusion models both lean on the normal as a latent prior because it’s easy to sample from and has nice analytic properties.
Note
OLS regression assumes , which is what justifies -tests on coefficients. If the residuals aren’t roughly normal, those p-values you’re reading off the regression output may not mean much.
Related Notes
- Central Limit Theorem - explains why the normal distribution appears so often
- Probability Distributions - the normal in context with other distribution families
- Regression Fundamentals - normality assumption on residuals
- Bayesian Inference - normal priors and conjugate updating