Intuition
You observe data and suspect it came from a known family of distributions (normal, Poisson, etc.), but you don’t know the exact parameters. Maximum likelihood estimation (MLE) asks: which parameter values would have made the observed data most probable? Pick those values. It is a simple, principled idea - yet it underpins nearly all of modern statistical inference and machine learning.
Definition
Let be i.i.d. observations from a distribution with PDF (or PMF) , where is an unknown parameter (or parameter vector). The likelihood function treats the data as fixed and the parameter as variable:
The maximum likelihood estimator is the value of that maximizes :
Because products are awkward to differentiate, we almost always work with the log-likelihood instead:
Since is monotonically increasing, maximizing is equivalent to maximizing .
Key Formulas
Finding the MLE - set the score function to zero:
and verify the second derivative is negative (maximum, not minimum).
MLE for the normal distribution: Given :
Note
The MLE for divides by , not . It is biased but asymptotically unbiased and consistent.
Key asymptotic properties (for large ):
| Property | Meaning |
|---|---|
| Consistency | as |
| Asymptotic normality | |
| Efficiency | Achieves the Cramer-Rao lower bound asymptotically |
| Invariance | If is MLE of , then is MLE of |
Here is the Fisher information: .
Example
Estimating failure probability. A component is tested times and fails times. Each test is Bernoulli with unknown probability . The log-likelihood is:
Setting :
The MLE says the best estimate of the failure probability is 16%. This is exactly the sample proportion - MLE often recovers familiar estimators as special cases. For lognormal data, the same approach yields MLEs for and by maximizing the log-normal log-likelihood.
Why It Matters in CS
- Neural network training: minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the training labels under the model’s predicted distribution.
- Logistic regression: the coefficients are found by maximizing the Bernoulli log-likelihood - there is no closed-form solution, so gradient ascent (or Newton’s method) is used.
- Language models: next-token prediction training maximizes , which is MLE over the training corpus.
- Model comparison: the Akaike Information Criterion (AIC) penalizes the maximized log-likelihood by the number of parameters, enabling principled model selection.
Related Notes
- Bayesian Inference - Bayesian estimation uses priors instead of pure likelihood maximization
- Probability Distributions - MLE estimates the parameters of these distribution families
- Regression Fundamentals - under normality, OLS and MLE yield identical coefficient estimates
- Normal Distribution - canonical MLE example for and