Intuition

Regression answers the question: how does one variable change when another changes? If you plot study hours against exam scores and see an upward trend, regression draws the “best” line through the cloud of points. “Best” means the line that makes the smallest total prediction errors. Once you have the line, you can predict scores for new students and quantify how much each extra hour of study is worth on average.

The power of regression is that it extends naturally: from one predictor (simple regression) to many (multiple regression), and from straight lines to curves. It is the workhorse of applied statistics and machine learning alike.

Core Idea

Simple linear regression

Model a response as a linear function of one predictor :

where is the intercept, is the slope, and is the error term.

Ordinary least squares (OLS)

OLS chooses and to minimize the residual sum of squares:

Taking partial derivatives and setting them to zero gives:

Multiple linear regression

With predictors, the model becomes:

where is the design matrix (including a column of ones for the intercept), is the coefficient vector, and is the error vector. The OLS solution:

Tip

In practice, never invert directly - use a QR decomposition or SVD for numerical stability. Most statistical libraries handle this automatically.

Residuals

The residual for observation is . Residuals are the diagnostic window into model quality:

  • Residuals vs. fitted values: should show no pattern (random scatter). Patterns indicate nonlinearity or heteroscedasticity.
  • Normal Q-Q plot: residuals should fall on a straight line if the normality assumption holds.
  • Scale-location plot: checks for constant variance (homoscedasticity).

Key assumptions (LINE)

LetterAssumptionViolation symptom
LLinearity - is linear in Curved residual pattern
IIndependence - errors are independentAutocorrelation in time-series data
NNormality - errors are normally distributedHeavy tails in Q-Q plot
EEqual variance - Fan or funnel shape in residuals

R-squared

Coefficient of determination measures the proportion of variance in explained by the model:

. Higher is better, but adding predictors always increases . Use adjusted to penalize for unnecessary predictors:

Warning

A high does not imply causation, nor does it mean the model is correctly specified. Always inspect residuals before trusting .

Example

Predicting house prices. Suppose we regress sale price (, in thousands) on square footage () for homes and obtain:

Interpretation: each additional square foot is associated with a $112 increase in price, on average. The intercept is the estimated price at zero square footage (not meaningful here - extrapolation beyond the data range).

With , square footage explains 74% of the variance in sale price. The remaining 26% is due to factors not in the model (location, condition, lot size, etc.) - motivation for multiple regression.