Intuition
Hypothesis testing is a framework for making decisions from data. You start with a default assumption (nothing interesting is happening) and ask: is the observed data surprising enough to reject that assumption? It is the statistical equivalent of proof by contradiction - assume the boring explanation and see if the evidence forces you to abandon it.
The tension in every test is between two kinds of mistakes: declaring an effect that is not there (false alarm) and missing an effect that is real (missed detection). The entire framework is built around controlling these error rates.
Core Idea
The hypotheses
- Null hypothesis : the default position. Typically “no effect” or “no difference.” Example: .
- Alternative hypothesis (or ): the claim you are testing. Example: (two-sided) or (one-sided).
The null is never “proven” - it is either rejected or not rejected.
Test statistic and p-value
A test statistic summarizes the data into a single number whose distribution under is known. For a sample mean with known variance:
Under , . When is unknown, replace it with the sample standard deviation and use the -distribution with degrees of freedom.
The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming is true:
A small p-value means the data are unlikely under .
Significance level and decision rule
Choose a significance level (commonly 0.05) before seeing the data. Reject when . The value directly controls the Type I error rate.
Warning
A p-value is not the probability that is true. It is the probability of the observed (or more extreme) data given . Confusing these is the single most common misinterpretation in applied statistics.
Error types
| true | false | |
|---|---|---|
| Reject | Type I error () | Correct (power) |
| Fail to reject | Correct () | Type II error () |
- Type I error (false positive): rejecting when it is true. Rate controlled at .
- Type II error (false negative): failing to reject when it is false. Probability denoted .
- Power : the probability of correctly rejecting a false .
Power and sample size
Power depends on four quantities: , effect size, sample size , and variability . Increasing or the effect size increases power. The relationship:
gives the minimum sample size for a two-sided -test to detect effect with power at level .
Tip
In practice, always do a power analysis before collecting data. Running a test on too-small a sample wastes resources and guarantees low power, meaning you will likely miss real effects.
Multiple testing
When running tests simultaneously, the probability of at least one false positive rises to . Common corrections:
- Bonferroni: test each at . Simple but conservative.
- Benjamini–Hochberg: controls the false discovery rate (FDR) rather than the family-wise error rate. More powerful for large .
Example
A/B test for click-through rate. A website tests a new button design against the current one. After visitors per group:
- Control: (120 clicks)
- Treatment: (150 clicks)
Under , the pooled proportion is . The test statistic:
For a two-sided test at , the critical value is . The p-value is approximately 0.05, right at the boundary - collecting more data would resolve the ambiguity.
Related Notes
- Probability Distributions - test statistics follow known distributions under
- Regression Fundamentals - hypothesis tests on regression coefficients
- Bayesian Inference - an alternative framework that quantifies directly