Intuition
Bayesian inference treats probability as a measure of belief rather than a long-run frequency. You start with a belief about how the world works (a prior), observe data, and then update that belief proportionally to how well each possible explanation predicts what you saw. The result is a full distribution over possible answers - not just a single point estimate - so you always know how certain or uncertain you are.
The core mechanic is simple: explanations that predicted the data well gain probability mass; explanations that predicted poorly lose it. As more data arrives, the prior matters less and the data dominates. This is the self-correcting property of Bayesian reasoning.
Core Idea
Bayes’ theorem
For a parameter and observed data :
| Term | Name | Role |
|---|---|---|
| Prior | What you believed before seeing data | |
| Likelihood | How probable the data are under each | |
| Posterior | Updated belief after seeing data | |
| Evidence (marginal likelihood) | Normalizing constant ensuring the posterior integrates to 1 |
Since is constant with respect to , the operational form is:
Choosing a prior
The prior encodes what you know (or assume) before the experiment:
- Informative priors: encode domain knowledge. Example: “the coin is roughly fair” → .
- Weakly informative priors: constrain the parameter to plausible ranges without being dogmatic. Example: .
- Non-informative (diffuse) priors: attempt minimal influence. Example: .
Tip
No prior is truly “objective.” Even a uniform prior is a choice. The important thing is to state your prior explicitly and check how sensitive the posterior is to that choice (prior sensitivity analysis).
Conjugate priors
When the prior and posterior belong to the same distribution family, the prior is conjugate to the likelihood. This yields closed-form updates:
| Likelihood | Conjugate prior | Posterior |
|---|---|---|
| Binomial | ||
| Poisson | ||
| Normal (known ) |
Conjugacy makes hand computation tractable. For non-conjugate models, numerical methods like Markov chain Monte Carlo (MCMC) sample from the posterior.
Sequential updating
A distinctive feature of Bayesian inference is sequential updating: today’s posterior becomes tomorrow’s prior. After observing data :
Then upon observing :
The final result is identical to updating on both datasets at once - order does not matter. This makes Bayesian methods natural for streaming data and online learning.
Bayesian vs. frequentist
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Probability refers to | Long-run frequency | Degree of belief |
| Parameters are | Fixed but unknown | Random variables with distributions |
| Result | Point estimate + confidence interval | Full posterior distribution |
| Prior information | Not formally incorporated | Encoded as prior distribution |
Note
The two frameworks often agree with large samples. The Bayesian advantage is most evident with small data, informative priors, or when you need to quantify uncertainty in a decision-theoretic way.
Example
Estimating a coin’s bias. You suspect a coin may be unfair. Your prior: , mildly favoring fairness. You flip 10 times and observe heads.
Prior parameters: , . After updating:
The posterior mean is , pulled toward 0.5 from the naive MLE of by the prior. The 95% credible interval (the Bayesian analogue of a confidence interval) is approximately - wide, reflecting genuine uncertainty from a small sample.
With 100 flips and 70 heads, the posterior becomes with mean and a much tighter credible interval . The data now dominate the prior.
Related Notes
- Probability Distributions - priors and posteriors are distributions
- Hypothesis Testing - frequentist alternative; Bayesian methods compute directly
- Regression Fundamentals - Bayesian regression places priors on coefficients