Intuition
Regression answers the question: how does one variable change when another changes? If you plot study hours against exam scores and see an upward trend, regression draws the “best” line through the cloud of points. “Best” means the line that makes the smallest total prediction errors. Once you have the line, you can predict scores for new students and quantify how much each extra hour of study is worth on average.
The power of regression is that it extends naturally: from one predictor (simple regression) to many (multiple regression), and from straight lines to curves. It is the workhorse of applied statistics and machine learning alike.
Core Idea
Simple linear regression
Model a response as a linear function of one predictor :
where is the intercept, is the slope, and is the error term.
Ordinary least squares (OLS)
OLS chooses and to minimize the residual sum of squares:
Taking partial derivatives and setting them to zero gives:
Multiple linear regression
With predictors, the model becomes:
where is the design matrix (including a column of ones for the intercept), is the coefficient vector, and is the error vector. The OLS solution:
Tip
In practice, never invert directly - use a QR decomposition or SVD for numerical stability. Most statistical libraries handle this automatically.
Residuals
The residual for observation is . Residuals are the diagnostic window into model quality:
- Residuals vs. fitted values: should show no pattern (random scatter). Patterns indicate nonlinearity or heteroscedasticity.
- Normal Q-Q plot: residuals should fall on a straight line if the normality assumption holds.
- Scale-location plot: checks for constant variance (homoscedasticity).
Key assumptions (LINE)
| Letter | Assumption | Violation symptom |
|---|---|---|
| L | Linearity - is linear in | Curved residual pattern |
| I | Independence - errors are independent | Autocorrelation in time-series data |
| N | Normality - errors are normally distributed | Heavy tails in Q-Q plot |
| E | Equal variance - | Fan or funnel shape in residuals |
R-squared
Coefficient of determination measures the proportion of variance in explained by the model:
. Higher is better, but adding predictors always increases . Use adjusted to penalize for unnecessary predictors:
Warning
A high does not imply causation, nor does it mean the model is correctly specified. Always inspect residuals before trusting .
Example
Predicting house prices. Suppose we regress sale price (, in thousands) on square footage () for homes and obtain:
Interpretation: each additional square foot is associated with a $112 increase in price, on average. The intercept is the estimated price at zero square footage (not meaningful here - extrapolation beyond the data range).
With , square footage explains 74% of the variance in sale price. The remaining 26% is due to factors not in the model (location, condition, lot size, etc.) - motivation for multiple regression.
Related Notes
- Probability Distributions - OLS residuals are assumed normally distributed
- Hypothesis Testing - -tests and -tests assess coefficient significance
- Bayesian Inference - Bayesian regression places priors on