Regression Fundamentals

Intuition

Regression answers the question: how does one variable change when another changes? If you plot study hours against exam scores and see an upward trend, regression draws the “best” line through the cloud of points. “Best” means the line that makes the smallest total prediction errors. Once you have the line, you can predict scores for new students and quantify how much each extra hour of study is worth on average.

The power of regression is that it extends naturally: from one predictor (simple regression) to many (multiple regression), and from straight lines to curves. It is the workhorse of applied statistics and machine learning alike.

Core Idea

Simple linear regression

Model a response $Y$ as a linear function of one predictor $X$ :

$Y_{i} = β_{0} + β_{1} X_{i} + ε_{i}, i = 1, \dots, n$

where $β_{0}$ is the intercept, $β_{1}$ is the slope, and $ε_{i}$ is the error term.

Ordinary least squares (OLS)

OLS chooses $\hat{β}_{0}$ and $\hat{β}_{1}$ to minimize the residual sum of squares:

$RSS = \sum_{i = 1}^{n} (Y_{i} - \hat{Y}_{i})^{2} = \sum_{i = 1}^{n} (Y_{i} - \hat{β}_{0} - \hat{β}_{1} X_{i})^{2}$

Taking partial derivatives and setting them to zero gives:

$\hat{β}_{1} = \frac{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ^{2}}, \hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}$

Multiple linear regression

With $p$ predictors, the model becomes:

$Y = Xβ + ε$

where $X$ is the $n \times (p + 1)$ design matrix (including a column of ones for the intercept), $β$ is the $(p + 1) \times 1$ coefficient vector, and $ε$ is the $n \times 1$ error vector. The OLS solution:

$\hat{β} = (X^{T} X)^{- 1} X^{T} Y$

Tip

In practice, never invert $X^{T} X$ directly - use a QR decomposition or SVD for numerical stability. Most statistical libraries handle this automatically.

Residuals

The residual for observation $i$ is $e_{i} = Y_{i} - \hat{Y}_{i}$ . Residuals are the diagnostic window into model quality:

Residuals vs. fitted values: should show no pattern (random scatter). Patterns indicate nonlinearity or heteroscedasticity.
Normal Q-Q plot: residuals should fall on a straight line if the normality assumption holds.
Scale-location plot: checks for constant variance (homoscedasticity).

Scatter plot with OLS fitted line showing vertical residual distances, plus residuals-vs-fitted subplot

Key assumptions (LINE)

Letter	Assumption	Violation symptom
L	Linearity - $E [Y ∣ X]$ is linear in $X$	Curved residual pattern
I	Independence - errors are independent	Autocorrelation in time-series data
N	Normality - errors are normally distributed	Heavy tails in Q-Q plot
E	Equal variance - $Var (ε_{i}) = σ^{2}$	Fan or funnel shape in residuals

R-squared

Coefficient of determination $R^{2}$ measures the proportion of variance in $Y$ explained by the model:

$R^{2} = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum ( Y _{i} - Y ^ _{i} ) ^{2}}{\sum ( Y _{i} - Y ˉ ) ^{2}}$

$R^{2} \in [0, 1]$ . Higher is better, but adding predictors always increases $R^{2}$ . Use adjusted $R^{2}$ to penalize for unnecessary predictors:

$R_{adj}^{2} = 1 - \frac{( 1 - R ^{2} ) ( n - 1 )}{n - p - 1}$

Warning

A high $R^{2}$ does not imply causation, nor does it mean the model is correctly specified. Always inspect residuals before trusting $R^{2}$ .

Example

Predicting house prices. Suppose we regress sale price ( $Y$ , in thousands) on square footage ( $X$ ) for $n = 50$ homes and obtain:

$\hat{Y} = 25.3 + 0.112 X$

Interpretation: each additional square foot is associated with a $112 increase in price, on average. The intercept $25.3$ is the estimated price at zero square footage (not meaningful here - extrapolation beyond the data range).

With $R^{2} = 0.74$ , square footage explains 74% of the variance in sale price. The remaining 26% is due to factors not in the model (location, condition, lot size, etc.) - motivation for multiple regression.

Probability Distributions - OLS residuals are assumed normally distributed
Hypothesis Testing - $t$ -tests and $F$ -tests assess coefficient significance
Bayesian Inference - Bayesian regression places priors on $β$

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Regression Fundamentals

Intuition

Core Idea

Simple linear regression

Ordinary least squares (OLS)

Multiple linear regression

Residuals

Key assumptions (LINE)

R-squared

Example

Graph View

Table of Contents

Backlinks

Cam's Cyberspace

Recent Notes

Algorithm Efficiency - Bridging Theory and Practice

Home

Best, Worst & Average Cases

Explorer

Regression Fundamentals

Intuition

Core Idea

Simple linear regression

Ordinary least squares (OLS)

Multiple linear regression

Residuals

Key assumptions (LINE)

R-squared

Example

Related Notes

Graph View

Table of Contents

Backlinks