Intuition

You observe data and suspect it came from a known family of distributions (normal, Poisson, etc.), but you don’t know the exact parameters. Maximum likelihood estimation (MLE) asks: which parameter values would have made the observed data most probable? Pick those values. It is a simple, principled idea - yet it underpins nearly all of modern statistical inference and machine learning.

Definition

Let be i.i.d. observations from a distribution with PDF (or PMF) , where is an unknown parameter (or parameter vector). The likelihood function treats the data as fixed and the parameter as variable:

The maximum likelihood estimator is the value of that maximizes :

Because products are awkward to differentiate, we almost always work with the log-likelihood instead:

Since is monotonically increasing, maximizing is equivalent to maximizing .

Key Formulas

Finding the MLE - set the score function to zero:

and verify the second derivative is negative (maximum, not minimum).

MLE for the normal distribution: Given :

Note

The MLE for divides by , not . It is biased but asymptotically unbiased and consistent.

Key asymptotic properties (for large ):

PropertyMeaning
Consistency as
Asymptotic normality
EfficiencyAchieves the Cramer-Rao lower bound asymptotically
InvarianceIf is MLE of , then is MLE of

Here is the Fisher information: .

Example

Estimating failure probability. A component is tested times and fails times. Each test is Bernoulli with unknown probability . The log-likelihood is:

Setting :

The MLE says the best estimate of the failure probability is 16%. This is exactly the sample proportion - MLE often recovers familiar estimators as special cases. For lognormal data, the same approach yields MLEs for and by maximizing the log-normal log-likelihood.

Why It Matters in CS

  • Neural network training: minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the training labels under the model’s predicted distribution.
  • Logistic regression: the coefficients are found by maximizing the Bernoulli log-likelihood - there is no closed-form solution, so gradient ascent (or Newton’s method) is used.
  • Language models: next-token prediction training maximizes , which is MLE over the training corpus.
  • Model comparison: the Akaike Information Criterion (AIC) penalizes the maximized log-likelihood by the number of parameters, enabling principled model selection.