Bayesian Inference

Intuition

Bayesian inference treats probability as a measure of belief rather than a long-run frequency. You start with a belief about how the world works (a prior), observe data, and then update that belief proportionally to how well each possible explanation predicts what you saw. The result is a full distribution over possible answers - not just a single point estimate - so you always know how certain or uncertain you are.

The core mechanic is simple: explanations that predicted the data well gain probability mass; explanations that predicted poorly lose it. As more data arrives, the prior matters less and the data dominates. This is the self-correcting property of Bayesian reasoning.

Core Idea

Bayes’ theorem

For a parameter $θ$ and observed data $D$ :

$P (θ ∣ D) = \frac{P ( D ∣ θ ) P ( θ )}{P ( D )}$

Term	Name	Role
$P (θ)$	Prior	What you believed before seeing data
$P (D ∣ θ)$	Likelihood	How probable the data are under each $θ$
$P (θ ∣ D)$	Posterior	Updated belief after seeing data
$P (D)$	Evidence (marginal likelihood)	Normalizing constant ensuring the posterior integrates to 1

Since $P (D)$ is constant with respect to $θ$ , the operational form is:

$posterior \propto likelihood \times prior$

Choosing a prior

The prior encodes what you know (or assume) before the experiment:

Informative priors: encode domain knowledge. Example: “the coin is roughly fair” → $θ \sim Beta (10, 10)$ .
Weakly informative priors: constrain the parameter to plausible ranges without being dogmatic. Example: $θ \sim Beta (2, 2)$ .
Non-informative (diffuse) priors: attempt minimal influence. Example: $θ \sim Beta (1, 1) = Uniform (0, 1)$ .

Tip

No prior is truly “objective.” Even a uniform prior is a choice. The important thing is to state your prior explicitly and check how sensitive the posterior is to that choice (prior sensitivity analysis).

Conjugate priors

When the prior and posterior belong to the same distribution family, the prior is conjugate to the likelihood. This yields closed-form updates:

Likelihood	Conjugate prior	Posterior
Binomial	$Beta (α, β)$	$Beta (α + k, β + n - k)$
Poisson	$Gamma (α, β)$	$Gamma (α + \sum x_{i}, β + n)$
Normal (known $σ$ )	$Normal (μ_{0}, σ_{0}^{2})$	$Normal (μ_{n}, σ_{n}^{2})$

Conjugacy makes hand computation tractable. For non-conjugate models, numerical methods like Markov chain Monte Carlo (MCMC) sample from the posterior.

Sequential updating

A distinctive feature of Bayesian inference is sequential updating: today’s posterior becomes tomorrow’s prior. After observing data $D_{1}$ :

$P (θ ∣ D_{1}) \propto P (D_{1} ∣ θ) P (θ)$

Then upon observing $D_{2}$ :

$P (θ ∣ D_{1}, D_{2}) \propto P (D_{2} ∣ θ) P (θ ∣ D_{1})$

The final result is identical to updating on both datasets at once - order does not matter. This makes Bayesian methods natural for streaming data and online learning.

Sequential Bayesian updating flow: prior times likelihood yields posterior, which becomes the next prior

Bayesian vs. frequentist

Aspect	Frequentist	Bayesian
Probability refers to	Long-run frequency	Degree of belief
Parameters are	Fixed but unknown	Random variables with distributions
Result	Point estimate + confidence interval	Full posterior distribution
Prior information	Not formally incorporated	Encoded as prior distribution

Note

The two frameworks often agree with large samples. The Bayesian advantage is most evident with small data, informative priors, or when you need to quantify uncertainty in a decision-theoretic way.

Example

Estimating a coin’s bias. You suspect a coin may be unfair. Your prior: $θ \sim Beta (2, 2)$ , mildly favoring fairness. You flip 10 times and observe $k = 7$ heads.

Prior parameters: $α = 2$ , $β = 2$ . After updating:

$θ ∣ D \sim Beta (2 + 7, 2 + 3) = Beta (9, 5)$

The posterior mean is $\frac{9}{9 + 5} = 0.643$ , pulled toward 0.5 from the naive MLE of $0.7$ by the prior. The 95% credible interval (the Bayesian analogue of a confidence interval) is approximately $[0.39, 0.85]$ - wide, reflecting genuine uncertainty from a small sample.

With 100 flips and 70 heads, the posterior becomes $Beta (72, 32)$ with mean $0.692$ and a much tighter credible interval $[0.60, 0.78]$ . The data now dominate the prior.

Probability Distributions - priors and posteriors are distributions
Hypothesis Testing - frequentist alternative; Bayesian methods compute $P (H ∣ D)$ directly
Regression Fundamentals - Bayesian regression places priors on coefficients

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Bayesian Inference

Intuition

Core Idea

Bayes’ theorem

Choosing a prior

Conjugate priors

Sequential updating

Bayesian vs. frequentist

Example

Graph View

Table of Contents

Backlinks

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Bayesian Inference

Intuition

Core Idea

Bayes’ theorem

Choosing a prior

Conjugate priors

Sequential updating

Bayesian vs. frequentist

Example

Related Notes

Graph View

Table of Contents

Backlinks