Intuition

The normal distribution is the symmetric bell curve that shows up whenever many small, independent effects add together. Heights, measurement errors, exam scores - all tend to cluster around a central value with symmetric tails. The curve is entirely described by two numbers: where it is centered and how wide it spreads. This simplicity, combined with the Central Limit Theorem, makes it the single most important distribution in statistics.

Definition

A continuous random variable follows a normal distribution with mean and standard deviation (written ) if its probability density function is:

  • controls the center (location) of the bell.
  • controls the width (spread); larger means flatter and wider.
  • The distribution is symmetric about , so the mean, median, and mode coincide.

The standard normal distribution is the special case . Its CDF is denoted and serves as the universal reference for all normal probabilities.

Key Formulas

Standardizing transformation - convert any normal variable to standard normal:

This lets you look up probabilities in a single -table or use a single CDF .

The 68-95-99.7 rule (empirical rule) - worth memorizing:

IntervalProbability

Moment-generating function:

Linear combinations: If are independent, then .

Example

Manufacturing tolerances. A machine produces ball bearings whose diameters follow mm. Bearings outside the tolerance mm are scrapped. What proportion is scrap?

Standardize the upper bound:

By symmetry, the proportion outside tolerance is:

So about 1.24% of production is scrapped - a number that directly informs cost analysis and quality control decisions.

If the process variance drifted to mm, the scrap rate would jump to - demonstrating how sensitive quality is to the spread parameter.

Why It Matters in CS

The 68-95-99.7 rule is burned into every engineer’s brain for a reason: it lets you eyeball whether data is behaving normally without running a formal test. If roughly 5% of your values fall outside two standard deviations, things are probably fine. If 20% do, something interesting is going on.

In ML, the normal shows up constantly. Weight initialization in neural networks samples from because symmetric, light-tailed starting points help gradient flow. Gaussian Mixture Models are just “what if the data came from overlapping bell curves?” Variational autoencoders and diffusion models both lean on the normal as a latent prior because it’s easy to sample from and has nice analytic properties.

Note

OLS regression assumes , which is what justifies -tests on coefficients. If the residuals aren’t roughly normal, those p-values you’re reading off the regression output may not mean much.