Preptide

Linear Models

Linear Regression, OLS, Regularization (Ridge/Lasso), and Evaluation Metrics

Ordinary Least Squares (OLS)

Finds the line (hyperplane) that minimizes the sum of squared vertical differences (residuals) between observed and predicted values.

Model: y=Xβ+ϵy = X\beta + \epsilon where XRn×pX \in \mathbb{R}^{n \times p}, βRp\beta \in \mathbb{R}^p, ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I).

Objective: minβyXβ22=minβ(yXβ)T(yXβ)\min_\beta \|y - X\beta\|_2^2 = \min_\beta (y - X\beta)^T (y - X\beta)

Derivation: β(yTy2βTXTy+βTXTXβ)=2XTy+2XTXβ=0\frac{\partial}{\partial \beta} (y^T y - 2\beta^T X^T y + \beta^T X^T X \beta) = -2X^T y + 2X^T X \beta = 0

Solution (Normal Equation): β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y

Geometric Interpretation: y^=Xβ^\hat{y} = X\hat{\beta} is orthogonal projection of yy onto column space of XX.

Assumptions:

  1. Linearity: True relationship is linear.
  2. Independence: Errors are independent.
  3. Homoscedasticity: Constant variance Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2.
  4. Normality: ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2) (for inference).
  5. No multicollinearity: Columns of XX are linearly independent.

Properties of OLS Estimator

Unbiasedness: E[β^]=βE[\hat{\beta}] = \beta.

Variance: Var(β^)=σ2(XTX)1\text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1}.

Gauss-Markov Theorem: Under assumptions 1-3, OLS is BLUE (Best Linear Unbiased Estimator) - has minimum variance among all linear unbiased estimators.

Residuals: e=yy^=yXβ^e = y - \hat{y} = y - X\hat{\beta}.

Residual Sum of Squares (RSS): RSS=e2=yXβ^2\text{RSS} = \|e\|^2 = \|y - X\hat{\beta}\|^2

Unbiased Estimator of σ2\sigma^2: σ^2=RSSnp\hat{\sigma}^2 = \frac{\text{RSS}}{n - p}

where npn-p is degrees of freedom.

Statistical Inference for OLS

Sampling Distribution (under normality assumption): β^N(β,σ2(XTX)1)\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X^T X)^{-1})

t-statistic for H0:βj=0H_0: \beta_j = 0: tj=β^jSE(β^j)=β^jσ^[(XTX)1]jjt_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}_j}{\hat{\sigma} \sqrt{[(X^T X)^{-1}]_{jj}}}

Follows tnpt_{n-p} distribution under H0H_0.

F-statistic for overall significance H0:β1==βp=0H_0: \beta_1 = \cdots = \beta_p = 0: F=(TSSRSS)/pRSS/(np1)F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)}

Follows Fp,np1F_{p, n-p-1} distribution under H0H_0.

Confidence Interval for βj\beta_j ((1α)(1-\alpha) level): β^j±tnp,α/2SE(β^j)\hat{\beta}_j \pm t_{n-p, \alpha/2} \cdot \text{SE}(\hat{\beta}_j)

Weighted Least Squares (WLS)

When errors have non-constant variance: Var(ϵi)=σ2/wi\text{Var}(\epsilon_i) = \sigma^2 / w_i.

Objective: minβi=1nwi(yixiTβ)2\min_\beta \sum_{i=1}^n w_i (y_i - x_i^T \beta)^2

Solution: β^WLS=(XTWX)1XTWy\hat{\beta}_{WLS} = (X^T W X)^{-1} X^T W y

where W=diag(w1,,wn)W = \text{diag}(w_1, \ldots, w_n).

Generalized Least Squares (GLS)

When errors have arbitrary covariance structure: Var(ϵ)=σ2Ω\text{Var}(\epsilon) = \sigma^2 \Omega.

Solution: β^GLS=(XTΩ1X)1XTΩ1y\hat{\beta}_{GLS} = (X^T \Omega^{-1} X)^{-1} X^T \Omega^{-1} y

Theorem: GLS is BLUE when errors are heteroscedastic/correlated.

Regularization

1. Ridge Regression (L2 Regularization)

Adds L2 penalty to OLS to handle multicollinearity and prevent overfitting. Corresponds to Gaussian Prior on β\beta in Bayesian framework.

Objective: minβyXβ22+λβ22\min_\beta \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2

Solution: β^ridge=(XTX+λI)1XTy\hat{\beta}_{ridge} = (X^T X + \lambda I)^{-1} X^T y

Properties:

  • Shrinkage: Coefficients shrink toward zero (but never exactly zero).
  • Bias-Variance Tradeoff: Increases bias, decreases variance.
  • Multicollinearity: Stabilizes estimates when XTXX^T X is ill-conditioned.
  • λ0\lambda \to 0: Approaches OLS.
  • λ\lambda \to \infty: β0\beta \to 0.

Choosing λ\lambda: Cross-validation, GCV (Generalized Cross-Validation).

2. Lasso Regression (L1 Regularization)

Adds L1 penalty. Performs variable selection (sets some coefficients exactly to zero).

Objective: minβyXβ22+λβ1\min_\beta \|y - X\beta\|_2^2 + \lambda \|\beta\|_1

No closed-form solution. Solved via:

  • Coordinate Descent: Update one coefficient at a time.
  • LARS (Least Angle Regression): Efficient path algorithm.
  • Proximal Gradient (ISTA/FISTA): Iterative soft-thresholding.

Soft-Thresholding Operator: Sλ(x)=sign(x)max(xλ,0)S_\lambda(x) = \text{sign}(x) \max(|x| - \lambda, 0)

Properties:

  • Sparsity: Some β^j=0\hat{\beta}_j = 0 (automatic feature selection).
  • Convex: Efficient to solve.
  • Corresponds to Laplace Prior on β\beta.

3. Elastic Net

Combines L1 and L2 penalties.

Objective: minβyXβ22+λ1β1+λ2β22\min_\beta \|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2

or equivalently: minβyXβ22+λ(αβ1+1α2β22)\min_\beta \|y - X\beta\|_2^2 + \lambda \left( \alpha \|\beta\|_1 + \frac{1-\alpha}{2} \|\beta\|_2^2 \right)

Advantages:

  • Handles correlated predictors better than Lasso.
  • Performs grouping (selects groups of correlated variables).
  • α=1\alpha = 1: Lasso. α=0\alpha = 0: Ridge.

4. Group Lasso

Encourages sparsity at the group level.

Objective: minβyXβ22+λg=1Ggβg2\min_\beta \|y - X\beta\|_2^2 + \lambda \sum_{g=1}^G \sqrt{|g|} \|\beta_g\|_2

where βg\beta_g are coefficients in group gg.

Bayesian Linear Regression

Probabilistic approach where we place a prior on the weights β\beta and compute the full posterior distribution P(βy,X)P(\beta | y, X) instead of a point estimate.

Model: y=Xβ+ϵ,ϵN(0,σ2I)y = X\beta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)

Likelihood: P(yX,β,σ2)=N(Xβ,σ2I)P(y | X, \beta, \sigma^2) = \mathcal{N}(X\beta, \sigma^2 I)

Prior (Conjugate Gaussian): P(β)=N(0,τ2I)P(\beta) = \mathcal{N}(0, \tau^2 I)

Posterior: Since Gaussian prior is conjugate to Gaussian likelihood, the posterior is also Gaussian: P(βy,X,σ2)N(μn,Σn)P(\beta | y, X, \sigma^2) \sim \mathcal{N}(\mu_n, \Sigma_n)

Mean and Variance: Σn=(τ2I+σ2XTX)1\Sigma_n = (\tau^{-2} I + \sigma^{-2} X^T X)^{-1} μn=σ2ΣnXTy\mu_n = \sigma^{-2} \Sigma_n X^T y

Interpretation:

  • Mean: Equivalent to Ridge Regression estimate (MAP) with λ=σ2/τ2\lambda = \sigma^2 / \tau^2.
  • Variance: Captures uncertainty in weights.

Predictive Distribution: For new input xnewx_{new}: P(ynewxnew,X,y)=P(ynewxnew,β)P(βX,y)dβP(y_{new} | x_{new}, X, y) = \int P(y_{new} | x_{new}, \beta) P(\beta | X, y) d\beta

This yields a Gaussian: ynewN(xnewTμn,σ2+xnewTΣnxnew)y_{new} \sim \mathcal{N}(x_{new}^T \mu_n, \sigma^2 + x_{new}^T \Sigma_n x_{new})

Variance Decomposition:

  • σ2\sigma^2: Irreducible (aleatoric) uncertainty.
  • xnewTΣnxnewx_{new}^T \Sigma_n x_{new}: Model (epistemic) uncertainty.

Model Selection and Diagnostics

1. Coefficient of Determination (R2R^2)

Proportion of variance in dependent variable explained by the model.

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

  • SSres=yy^2SS_{res} = \|y - \hat{y}\|^2 (Residual Sum of Squares)
  • SStot=yyˉ2SS_{tot} = \|y - \bar{y}\|^2 (Total Sum of Squares)

Range: (,1](-\infty, 1]. R2=1R^2 = 1 is perfect fit.

Interpretation: R2=0.7R^2 = 0.7 means 70% of variance is explained by model.

Issue: R2R^2 always increases with more features (even if irrelevant).

2. Adjusted R2R^2

Penalizes model complexity.

Radj2=1SSres/(np1)SStot/(n1)R_{adj}^2 = 1 - \frac{SS_{res}/(n-p-1)}{SS_{tot}/(n-1)}

Can decrease when adding irrelevant features.

3. Akaike Information Criterion (AIC)

Information-theoretic model selection criterion.

AIC=2k2ln(L^)AIC = 2k - 2\ln(\hat{L})

where kk is number of parameters, L^\hat{L} is maximized likelihood.

For linear regression (Gaussian errors): AIC=nln(RSSn)+2pAIC = n \ln\left(\frac{RSS}{n}\right) + 2p

Lower is better. Penalizes complexity.

4. Bayesian Information Criterion (BIC)

Similar to AIC but stronger penalty.

BIC=kln(n)2ln(L^)BIC = k \ln(n) - 2\ln(\hat{L})

For linear regression: BIC=nln(RSSn)+pln(n)BIC = n \ln\left(\frac{RSS}{n}\right) + p \ln(n)

BIC tends to select simpler models than AIC.

5. Cross-Validation

Estimate generalization error by splitting data.

k-Fold CV:

  1. Split data into kk folds.
  2. Train on k1k-1 folds, validate on remaining fold.
  3. Repeat for all kk folds.
  4. Average error.

Leave-One-Out CV (LOOCV): k=nk = n. Expensive but low bias.

CV Score: CV=1ki=1kErroriCV = \frac{1}{k} \sum_{i=1}^k \text{Error}_i

6. Residual Analysis

Residual Plot: Plot eie_i vs y^i\hat{y}_i. Should show no pattern (random scatter).

Normal Q-Q Plot: Check normality of residuals.

Leverage: hii=[X(XTX)1XT]iih_{ii} = [X(X^T X)^{-1} X^T]_{ii}. High leverage points have unusual xx values.

Cook's Distance: Measures influence of each observation. Di=ei2pσ^2hii(1hii)2D_i = \frac{e_i^2}{p \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2}

Large DiD_i (typically >1> 1) indicates influential point.

VIF (Variance Inflation Factor): Detects multicollinearity. VIFj=11Rj2VIF_j = \frac{1}{1 - R_j^2} where Rj2R_j^2 is R2R^2 from regressing XjX_j on other predictors. VIF>10VIF > 10 suggests multicollinearity.

Advanced Topics

1. Principal Component Regression (PCR)

Perform PCA on XX, then regress yy on principal components.

Steps:

  1. Compute PCA: X=UΣVTX = U \Sigma V^T.
  2. Project: Z=XVZ = XV (PC scores).
  3. Regress: y=Zθ+ϵy = Z\theta + \epsilon.
  4. Back-transform: β^=Vθ^\hat{\beta} = V\hat{\theta}.

Advantage: Handles multicollinearity. Uses only top PCs (dimensionality reduction).

2. Partial Least Squares (PLS)

Like PCR but finds directions that maximize covariance with yy (supervised).

Advantage: Often better than PCR when p>>np >> n.

3. Quantile Regression

Estimates conditional quantiles instead of conditional mean.

Objective (for τ\tau-th quantile): minβi=1nρτ(yixiTβ)\min_\beta \sum_{i=1}^n \rho_\tau(y_i - x_i^T \beta)

where ρτ(u)=u(τ1u<0)\rho_\tau(u) = u(\tau - \mathbb{1}_{u < 0}) is check function.

τ=0.5\tau = 0.5: Median regression (robust to outliers).

4. Robust Regression

Less sensitive to outliers.

Huber Loss: Combines L2 (for small residuals) and L1 (for large residuals). Lδ(e)={12e2eδδ(e12δ)e>δL_\delta(e) = \begin{cases} \frac{1}{2}e^2 & |e| \leq \delta \\ \delta(|e| - \frac{1}{2}\delta) & |e| > \delta \end{cases}

RANSAC: Random Sample Consensus. Iteratively fit model to random subsets, find consensus.

5. Logistic Regression

For binary classification y{0,1}y \in \{0, 1\}.

Model: P(y=1x)=11+exTβ=σ(xTβ)P(y=1 | x) = \frac{1}{1 + e^{-x^T \beta}} = \sigma(x^T \beta)

Log-Odds (Logit): logP(y=1x)P(y=0x)=xTβ\log \frac{P(y=1|x)}{P(y=0|x)} = x^T \beta

Loss (Negative Log-Likelihood): L(β)=i=1n[yilogy^i+(1yi)log(1y^i)]L(\beta) = -\sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]

where y^i=σ(xiTβ)\hat{y}_i = \sigma(x_i^T \beta).

No closed-form solution. Optimize via Newton-Raphson or IRLS (Iteratively Reweighted Least Squares).

Regularization: L1 (Lasso) or L2 (Ridge) can be added.

6. Multinomial Logistic Regression (Softmax Regression)

Generalization to KK classes.

Model: P(y=kx)=exTβkj=1KexTβjP(y=k | x) = \frac{e^{x^T \beta_k}}{\sum_{j=1}^K e^{x^T \beta_j}}

Loss (Cross-Entropy): L(β)=i=1nk=1K1yi=klogP(yi=kxi)L(\beta) = -\sum_{i=1}^n \sum_{k=1}^K \mathbb{1}_{y_i = k} \log P(y_i = k | x_i)

7. Generalized Linear Models (GLM)

Framework that extends linear regression to non-Gaussian responses.

Components:

  1. Random Component: yy \sim distribution from exponential family.
  2. Systematic Component: Linear predictor η=Xβ\eta = X\beta.
  3. Link Function: g(μ)=ηg(\mu) = \eta where μ=E[y]\mu = E[y].

Examples:

  • Linear Regression: Identity link, Gaussian distribution.
  • Logistic Regression: Logit link, Binomial distribution.
  • Poisson Regression: Log link, Poisson distribution.

Evaluation Metrics

1. Mean Squared Error (MSE)

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

2. Root Mean Squared Error (RMSE)

RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}

Same units as yy.

3. Mean Absolute Error (MAE)

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|

More robust to outliers than MSE.

4. F1 Score (Classification)

Harmonic mean of Precision and Recall. Used for classification (especially imbalanced data).

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

5. ROC-AUC (Classification)

ROC Curve: Plot TPR vs FPR at various thresholds. AUC: Area Under Curve. AUC=1AUC = 1 is perfect, AUC=0.5AUC = 0.5 is random.

6. Log Loss (Classification)

LogLoss=1ni=1n[yilogy^i+(1yi)log(1y^i)]\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]

Penalizes confident wrong predictions.