Intuition
You observe an effect and want to know which cause produced it. The trouble is that you know the probability in the forward direction - how likely each cause is to produce the effect - but you need the reverse: how likely each cause is, given that the effect occurred.
Bayes’ rule is the bridge. It takes a forward conditional probability , combines it with how common each cause is on its own (the prior), and flips the direction to give . The more strongly a cause predicts the observed evidence, and the more common that cause is a priori, the more posterior probability it gets.
Warning
The prior matters more than most people expect. When the cause is rare, even strong evidence may not make it the most probable explanation. This is the base rate fallacy, and it trips up both students and working engineers.
Definition
Given a partition of the sample space and an observed event with , Bayes’ rule states:
The denominator is the law of total probability - it ensures the posterior sums to 1 across all .
Tip
The denominator is often the hardest part to compute. In practice, you can evaluate the numerator for each and then normalize. This is exactly what many inference algorithms do.
Posterior form (continuous parameter)
When the “cause” is a continuous parameter with prior density and the data have likelihood :
This is the starting point of Bayesian Inference, which builds a full framework around this formula - prior selection, conjugacy, sequential updating, and computational methods like MCMC.
Key Formulas
Two-event form
For two events and :
Odds form
Bayes’ rule is sometimes cleaner as an odds ratio. The posterior odds of vs. given :
The likelihood ratio (also called the Bayes factor) measures how much the evidence favors over .
Example
Defective product on an assembly line. A factory has three machines producing bolts:
| Machine | Share of production | Defect rate |
|---|---|---|
| 30% | 3% | |
| 45% | 2% | |
| 25% | 4% |
A bolt is randomly selected and found to be defective (). Which machine most likely produced it?
First, compute via total probability:
Now apply Bayes’ rule for each machine:
Machine is the most likely source despite producing only 25% of bolts, because its defect rate is the highest. Bayes’ rule balances volume against defect propensity.
Why It Matters in CS
The most famous application is the Naive Bayes classifier, which is really just this formula applied at scale. You compute by assuming each word contributes independently to the evidence. The “naive” part is that independence assumption, which is almost never true and yet the classifier works shockingly well for spam filtering and text categorization. Paul Graham’s 2002 essay on Bayesian spam filtering basically killed first-generation spam by computing per word and combining evidence across the message.
Tip
Bayes’ rule is also the reason base rates matter so much in security. An intrusion detection system with a 99% true positive rate still generates mostly false alarms if only 0.1% of traffic is actually malicious. The prior dominates when the event is rare, and forgetting this is one of the classic mistakes in anomaly detection.
Related Notes
- Conditional Probability - the foundation Bayes’ rule rearranges
- Bayesian Inference - the full inference framework built around this rule
- Probability Distributions - priors and likelihoods are distributions