Intuition

Bayesian inference treats probability as a measure of belief rather than a long-run frequency. You start with a belief about how the world works (a prior), observe data, and then update that belief proportionally to how well each possible explanation predicts what you saw. The result is a full distribution over possible answers - not just a single point estimate - so you always know how certain or uncertain you are.

The core mechanic is simple: explanations that predicted the data well gain probability mass; explanations that predicted poorly lose it. As more data arrives, the prior matters less and the data dominates. This is the self-correcting property of Bayesian reasoning.

Core Idea

Bayes’ theorem

For a parameter and observed data :

TermNameRole
PriorWhat you believed before seeing data
LikelihoodHow probable the data are under each
PosteriorUpdated belief after seeing data
Evidence (marginal likelihood)Normalizing constant ensuring the posterior integrates to 1

Since is constant with respect to , the operational form is:

Choosing a prior

The prior encodes what you know (or assume) before the experiment:

  • Informative priors: encode domain knowledge. Example: “the coin is roughly fair” → .
  • Weakly informative priors: constrain the parameter to plausible ranges without being dogmatic. Example: .
  • Non-informative (diffuse) priors: attempt minimal influence. Example: .

Tip

No prior is truly “objective.” Even a uniform prior is a choice. The important thing is to state your prior explicitly and check how sensitive the posterior is to that choice (prior sensitivity analysis).

Conjugate priors

When the prior and posterior belong to the same distribution family, the prior is conjugate to the likelihood. This yields closed-form updates:

LikelihoodConjugate priorPosterior
Binomial
Poisson
Normal (known )

Conjugacy makes hand computation tractable. For non-conjugate models, numerical methods like Markov chain Monte Carlo (MCMC) sample from the posterior.

Sequential updating

A distinctive feature of Bayesian inference is sequential updating: today’s posterior becomes tomorrow’s prior. After observing data :

Then upon observing :

The final result is identical to updating on both datasets at once - order does not matter. This makes Bayesian methods natural for streaming data and online learning.

Bayesian vs. frequentist

AspectFrequentistBayesian
Probability refers toLong-run frequencyDegree of belief
Parameters areFixed but unknownRandom variables with distributions
ResultPoint estimate + confidence intervalFull posterior distribution
Prior informationNot formally incorporatedEncoded as prior distribution

Note

The two frameworks often agree with large samples. The Bayesian advantage is most evident with small data, informative priors, or when you need to quantify uncertainty in a decision-theoretic way.

Example

Estimating a coin’s bias. You suspect a coin may be unfair. Your prior: , mildly favoring fairness. You flip 10 times and observe heads.

Prior parameters: , . After updating:

The posterior mean is , pulled toward 0.5 from the naive MLE of by the prior. The 95% credible interval (the Bayesian analogue of a confidence interval) is approximately - wide, reflecting genuine uncertainty from a small sample.

With 100 flips and 70 heads, the posterior becomes with mean and a much tighter credible interval . The data now dominate the prior.