Hypothesis Testing

Intuition

Hypothesis testing is a framework for making decisions from data. You start with a default assumption (nothing interesting is happening) and ask: is the observed data surprising enough to reject that assumption? It is the statistical equivalent of proof by contradiction - assume the boring explanation and see if the evidence forces you to abandon it.

The tension in every test is between two kinds of mistakes: declaring an effect that is not there (false alarm) and missing an effect that is real (missed detection). The entire framework is built around controlling these error rates.

Core Idea

The hypotheses

Null hypothesis $H_{0}$ : the default position. Typically “no effect” or “no difference.” Example: $H_{0} : μ = μ_{0}$ .
Alternative hypothesis $H_{1}$ (or $H_{a}$ ): the claim you are testing. Example: $H_{1} : μ \neq = μ_{0}$ (two-sided) or $H_{1} : μ > μ_{0}$ (one-sided).

The null is never “proven” - it is either rejected or not rejected.

Test statistic and p-value

A test statistic summarizes the data into a single number whose distribution under $H_{0}$ is known. For a sample mean $\overset{ˉ}{X}$ with known variance:

$Z = \frac{X ˉ - μ _{0}}{σ / n}$

Under $H_{0}$ , $Z \sim N (0, 1)$ . When $σ$ is unknown, replace it with the sample standard deviation $s$ and use the $t$ -distribution with $n - 1$ degrees of freedom.

The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming $H_{0}$ is true:

$p = P (∣ Z ∣ \geq ∣ z_{obs} ∣ ∣ H_{0})$

A small p-value means the data are unlikely under $H_{0}$ .

Significance level and decision rule

Choose a significance level $α$ (commonly 0.05) before seeing the data. Reject $H_{0}$ when $p \leq α$ . The value $α$ directly controls the Type I error rate.

Warning

A p-value is not the probability that $H_{0}$ is true. It is the probability of the observed (or more extreme) data given $H_{0}$ . Confusing these is the single most common misinterpretation in applied statistics.

Error types

	$H_{0}$ true	$H_{0}$ false
Reject $H_{0}$	Type I error ( $α$ )	Correct (power)
Fail to reject	Correct ( $1 - α$ )	Type II error ( $β$ )

Type I error (false positive): rejecting $H_{0}$ when it is true. Rate controlled at $α$ .
Type II error (false negative): failing to reject $H_{0}$ when it is false. Probability denoted $β$ .
Power $= 1 - β$ : the probability of correctly rejecting a false $H_{0}$ .

Two overlapping normal curves under H0 and H1 showing alpha (Type I) and beta (Type II) error regions, critical value, and power = 1-beta

Power and sample size

Power depends on four quantities: $α$ , effect size, sample size $n$ , and variability $σ$ . Increasing $n$ or the effect size increases power. The relationship:

$n \geq (\frac{z _{α /2} + z _{β}}{δ / σ})^{2}$

gives the minimum sample size for a two-sided $z$ -test to detect effect $δ$ with power $1 - β$ at level $α$ .

Tip

In practice, always do a power analysis before collecting data. Running a test on too-small a sample wastes resources and guarantees low power, meaning you will likely miss real effects.

Multiple testing

When running $m$ tests simultaneously, the probability of at least one false positive rises to $1 - (1 - α)^{m}$ . Common corrections:

Bonferroni: test each at $α / m$ . Simple but conservative.
Benjamini–Hochberg: controls the false discovery rate (FDR) rather than the family-wise error rate. More powerful for large $m$ .

Example

A/B test for click-through rate. A website tests a new button design against the current one. After $n = 1, 000$ visitors per group:

Control: $\overset{p}{^}_{1} = 0.12$ (120 clicks)
Treatment: $\overset{p}{^}_{2} = 0.15$ (150 clicks)

Under $H_{0} : p_{1} = p_{2}$ , the pooled proportion is $\overset{p}{^} = 0.135$ . The test statistic:

$Z = \frac{0.15 - 0.12}{p ^ ( 1 - p ^ ) ( \frac{1}{1000} + \frac{1}{1000} )} = \frac{0.03}{0.0153} \approx 1.96$

For a two-sided test at $α = 0.05$ , the critical value is $z_{0.025} = 1.96$ . The p-value is approximately 0.05, right at the boundary - collecting more data would resolve the ambiguity.

Probability Distributions - test statistics follow known distributions under $H_{0}$
Regression Fundamentals - hypothesis tests on regression coefficients
Bayesian Inference - an alternative framework that quantifies $P (H_{0} ∣ data)$ directly

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Hypothesis Testing

Intuition

Core Idea

The hypotheses

Test statistic and p-value

Significance level and decision rule

Error types

Power and sample size

Multiple testing

Example

Graph View

Table of Contents

Backlinks

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Hypothesis Testing

Intuition

Core Idea

The hypotheses

Test statistic and p-value

Significance level and decision rule

Error types

Power and sample size

Multiple testing

Example

Related Notes

Graph View

Table of Contents

Backlinks