Maximum Likelihood Estimation

Intuition

You observe data and suspect it came from a known family of distributions (normal, Poisson, etc.), but you don’t know the exact parameters. Maximum likelihood estimation (MLE) asks: which parameter values would have made the observed data most probable? Pick those values. It is a simple, principled idea - yet it underpins nearly all of modern statistical inference and machine learning.

Definition

Let $x_{1}, x_{2}, \dots, x_{n}$ be i.i.d. observations from a distribution with PDF (or PMF) $f (x; θ)$ , where $θ$ is an unknown parameter (or parameter vector). The likelihood function treats the data as fixed and the parameter as variable:

$L (θ) = \prod_{i = 1}^{n} f (x_{i}; θ)$

The maximum likelihood estimator $\hat{θ}_{MLE}$ is the value of $θ$ that maximizes $L (θ)$ :

$\hat{θ}_{MLE} = ar g max_{θ} L (θ)$

Because products are awkward to differentiate, we almost always work with the log-likelihood instead:

$ℓ (θ) = ln L (θ) = \sum_{i = 1}^{n} ln f (x_{i}; θ)$

Since $ln$ is monotonically increasing, maximizing $ℓ (θ)$ is equivalent to maximizing $L (θ)$ .

Log-likelihood curve with peak at theta-hat MLE

Key Formulas

Finding the MLE - set the score function to zero:

$\frac{\partial ℓ ( θ )}{\partial θ} = 0$

and verify the second derivative is negative (maximum, not minimum).

MLE for the normal distribution: Given $X_{i} \sim N (μ, σ^{2})$ :

$\overset{μ}{^}_{MLE} = \overset{ˉ}{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}, \overset{σ}{^}_{MLE}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (X_{i} - \overset{ˉ}{X})^{2}$

Note

The MLE for $σ^{2}$ divides by $n$ , not $n - 1$ . It is biased but asymptotically unbiased and consistent.

Key asymptotic properties (for large $n$ ):

Property	Meaning
Consistency	$\hat{θ} p θ_{0}$ as $n \to \infty$
Asymptotic normality	$\hat{θ} \approx N (θ_{0}, [I (θ_{0})]^{- 1})$
Efficiency	Achieves the Cramer-Rao lower bound asymptotically
Invariance	If $\hat{θ}$ is MLE of $θ$ , then $g (\hat{θ})$ is MLE of $g (θ)$

Here $I (θ)$ is the Fisher information: $I (θ) = - E [\frac{\partial ^{2} ℓ}{\partial θ ^{2}}]$ .

Example

Estimating failure probability. A component is tested $n = 50$ times and fails $k = 8$ times. Each test is Bernoulli with unknown probability $p$ . The log-likelihood is:

$ℓ (p) = k ln p + (n - k) ln (1 - p)$

Setting $\frac{d ℓ}{d p} = 0$ :

$\frac{k}{p} - \frac{n - k}{1 - p} = 0 ⟹ \overset{p}{^}_{MLE} = \frac{k}{n} = \frac{8}{50} = 0.16$

The MLE says the best estimate of the failure probability is 16%. This is exactly the sample proportion - MLE often recovers familiar estimators as special cases. For lognormal data, the same approach yields MLEs for $μ$ and $σ^{2}$ by maximizing the log-normal log-likelihood.

Why It Matters in CS

Neural network training: minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the training labels under the model’s predicted distribution.
Logistic regression: the coefficients are found by maximizing the Bernoulli log-likelihood - there is no closed-form solution, so gradient ascent (or Newton’s method) is used.
Language models: next-token prediction training maximizes $\sum ln P (w_{t} ∣ w_{< t}; θ)$ , which is MLE over the training corpus.
Model comparison: the Akaike Information Criterion (AIC) penalizes the maximized log-likelihood by the number of parameters, enabling principled model selection.

Bayesian Inference - Bayesian estimation uses priors instead of pure likelihood maximization
Probability Distributions - MLE estimates the parameters of these distribution families
Regression Fundamentals - under normality, OLS and MLE yield identical coefficient estimates
Normal Distribution - canonical MLE example for $μ$ and $σ^{2}$

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Maximum Likelihood Estimation

Intuition

Definition

Key Formulas

Example

Why It Matters in CS

Graph View

Table of Contents

Backlinks

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Maximum Likelihood Estimation

Intuition

Definition

Key Formulas

Example

Why It Matters in CS

Related Notes

Graph View

Table of Contents

Backlinks