Bayes' Rule

Intuition

You observe an effect and want to know which cause produced it. The trouble is that you know the probability in the forward direction - how likely each cause is to produce the effect - but you need the reverse: how likely each cause is, given that the effect occurred.

Bayes’ rule is the bridge. It takes a forward conditional probability $P (A ∣ B)$ , combines it with how common each cause is on its own (the prior), and flips the direction to give $P (B ∣ A)$ . The more strongly a cause predicts the observed evidence, and the more common that cause is a priori, the more posterior probability it gets.

Warning

The prior matters more than most people expect. When the cause is rare, even strong evidence may not make it the most probable explanation. This is the base rate fallacy, and it trips up both students and working engineers.

Definition

Given a partition ${B_{1}, B_{2}, \dots, B_{k}}$ of the sample space and an observed event $A$ with $P (A) > 0$ , Bayes’ rule states:

$P (B_{r} ∣ A) = \frac{P ( B _{r} ) P ( A ∣ B _{r} )}{\sum _{i = 1}^{k} P ( B _{i} ) P ( A ∣ B _{i} )}$

The denominator is the law of total probability - it ensures the posterior sums to 1 across all $B_{i}$ .

Tip

The denominator is often the hardest part to compute. In practice, you can evaluate the numerator for each $B_{i}$ and then normalize. This is exactly what many inference algorithms do.

Sample space partitioned into B1, B2, B3 with event A overlapping all three

Posterior form (continuous parameter)

When the “cause” is a continuous parameter $θ$ with prior density $π (θ)$ and the data $x$ have likelihood $f (x ∣ θ)$ :

$π (θ ∣ x) = \frac{f ( x ∣ θ ) π ( θ )}{g ( x )}, g (x) = \int f (x ∣ θ) π (θ) d θ$

This is the starting point of Bayesian Inference, which builds a full framework around this formula - prior selection, conjugacy, sequential updating, and computational methods like MCMC.

Key Formulas

Two-event form

For two events $A$ and $B$ :

$P (B ∣ A) = \frac{P ( A ∣ B ) P ( B )}{P ( A )}$

Odds form

Bayes’ rule is sometimes cleaner as an odds ratio. The posterior odds of $B_{1}$ vs. $B_{2}$ given $A$ :

$\frac{P ( B _{1} ∣ A )}{P ( B _{2} ∣ A )} = \frac{P ( B _{1} )}{P ( B _{2} )} \times \frac{P ( A ∣ B _{1} )}{P ( A ∣ B _{2} )}$

$posterior odds = prior odds \times likelihood ratio$

The likelihood ratio (also called the Bayes factor) measures how much the evidence $A$ favors $B_{1}$ over $B_{2}$ .

Example

Defective product on an assembly line. A factory has three machines producing bolts:

Machine	Share of production	Defect rate
$M_{1}$	30%	3%
$M_{2}$	45%	2%
$M_{3}$	25%	4%

A bolt is randomly selected and found to be defective ( $D$ ). Which machine most likely produced it?

First, compute $P (D)$ via total probability:

$P (D) = 0.30 (0.03) + 0.45 (0.02) + 0.25 (0.04) = 0.009 + 0.009 + 0.010 = 0.028$

Now apply Bayes’ rule for each machine:

$P (M_{1} ∣ D) = \frac{0.30 \times 0.03}{0.028} = \frac{0.009}{0.028} \approx 0.321$

$P (M_{2} ∣ D) = \frac{0.45 \times 0.02}{0.028} = \frac{0.009}{0.028} \approx 0.321$

$P (M_{3} ∣ D) = \frac{0.25 \times 0.04}{0.028} = \frac{0.010}{0.028} \approx 0.357$

Machine $M_{3}$ is the most likely source despite producing only 25% of bolts, because its defect rate is the highest. Bayes’ rule balances volume against defect propensity.

Why It Matters in CS

The most famous application is the Naive Bayes classifier, which is really just this formula applied at scale. You compute $P (class ∣ words)$ by assuming each word contributes independently to the evidence. The “naive” part is that independence assumption, which is almost never true and yet the classifier works shockingly well for spam filtering and text categorization. Paul Graham’s 2002 essay on Bayesian spam filtering basically killed first-generation spam by computing $P (spam ∣ word appears)$ per word and combining evidence across the message.

Tip

Bayes’ rule is also the reason base rates matter so much in security. An intrusion detection system with a 99% true positive rate still generates mostly false alarms if only 0.1% of traffic is actually malicious. The prior dominates when the event is rare, and forgetting this is one of the classic mistakes in anomaly detection.

Conditional Probability - the foundation Bayes’ rule rearranges
Bayesian Inference - the full inference framework built around this rule
Probability Distributions - priors and likelihoods are distributions

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Bayes' Rule

Intuition

Definition

Posterior form (continuous parameter)

Key Formulas

Two-event form

Odds form

Example

Why It Matters in CS

Graph View

Table of Contents

Backlinks

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Bayes' Rule

Intuition

Definition

Posterior form (continuous parameter)

Key Formulas

Two-event form

Odds form

Example

Why It Matters in CS

Related Notes

Graph View

Table of Contents

Backlinks