Linear Models
Linear Regression, OLS, Regularization (Ridge/Lasso), and Evaluation Metrics
Ordinary Least Squares (OLS)
Finds the line (hyperplane) that minimizes the sum of squared vertical differences (residuals) between observed and predicted values.
Model: where , , .
Objective:
Derivation:
Solution (Normal Equation):
Geometric Interpretation: is orthogonal projection of onto column space of .
Assumptions:
- Linearity: True relationship is linear.
- Independence: Errors are independent.
- Homoscedasticity: Constant variance .
- Normality: (for inference).
- No multicollinearity: Columns of are linearly independent.
Properties of OLS Estimator
Unbiasedness: .
Variance: .
Gauss-Markov Theorem: Under assumptions 1-3, OLS is BLUE (Best Linear Unbiased Estimator) - has minimum variance among all linear unbiased estimators.
Residuals: .
Residual Sum of Squares (RSS):
Unbiased Estimator of :
where is degrees of freedom.
Statistical Inference for OLS
Sampling Distribution (under normality assumption):
t-statistic for :
Follows distribution under .
F-statistic for overall significance :
Follows distribution under .
Confidence Interval for ( level):
Weighted Least Squares (WLS)
When errors have non-constant variance: .
Objective:
Solution:
where .
Generalized Least Squares (GLS)
When errors have arbitrary covariance structure: .
Solution:
Theorem: GLS is BLUE when errors are heteroscedastic/correlated.
Regularization
1. Ridge Regression (L2 Regularization)
Adds L2 penalty to OLS to handle multicollinearity and prevent overfitting. Corresponds to Gaussian Prior on in Bayesian framework.
Objective:
Solution:
Properties:
- Shrinkage: Coefficients shrink toward zero (but never exactly zero).
- Bias-Variance Tradeoff: Increases bias, decreases variance.
- Multicollinearity: Stabilizes estimates when is ill-conditioned.
- : Approaches OLS.
- : .
Choosing : Cross-validation, GCV (Generalized Cross-Validation).
2. Lasso Regression (L1 Regularization)
Adds L1 penalty. Performs variable selection (sets some coefficients exactly to zero).
Objective:
No closed-form solution. Solved via:
- Coordinate Descent: Update one coefficient at a time.
- LARS (Least Angle Regression): Efficient path algorithm.
- Proximal Gradient (ISTA/FISTA): Iterative soft-thresholding.
Soft-Thresholding Operator:
Properties:
- Sparsity: Some (automatic feature selection).
- Convex: Efficient to solve.
- Corresponds to Laplace Prior on .
3. Elastic Net
Combines L1 and L2 penalties.
Objective:
or equivalently:
Advantages:
- Handles correlated predictors better than Lasso.
- Performs grouping (selects groups of correlated variables).
- : Lasso. : Ridge.
4. Group Lasso
Encourages sparsity at the group level.
Objective:
where are coefficients in group .
Bayesian Linear Regression
Probabilistic approach where we place a prior on the weights and compute the full posterior distribution instead of a point estimate.
Model:
Likelihood:
Prior (Conjugate Gaussian):
Posterior: Since Gaussian prior is conjugate to Gaussian likelihood, the posterior is also Gaussian:
Mean and Variance:
Interpretation:
- Mean: Equivalent to Ridge Regression estimate (MAP) with .
- Variance: Captures uncertainty in weights.
Predictive Distribution: For new input :
This yields a Gaussian:
Variance Decomposition:
- : Irreducible (aleatoric) uncertainty.
- : Model (epistemic) uncertainty.
Model Selection and Diagnostics
1. Coefficient of Determination ()
Proportion of variance in dependent variable explained by the model.
- (Residual Sum of Squares)
- (Total Sum of Squares)
Range: . is perfect fit.
Interpretation: means 70% of variance is explained by model.
Issue: always increases with more features (even if irrelevant).
2. Adjusted
Penalizes model complexity.
Can decrease when adding irrelevant features.
3. Akaike Information Criterion (AIC)
Information-theoretic model selection criterion.
where is number of parameters, is maximized likelihood.
For linear regression (Gaussian errors):
Lower is better. Penalizes complexity.
4. Bayesian Information Criterion (BIC)
Similar to AIC but stronger penalty.
For linear regression:
BIC tends to select simpler models than AIC.
5. Cross-Validation
Estimate generalization error by splitting data.
k-Fold CV:
- Split data into folds.
- Train on folds, validate on remaining fold.
- Repeat for all folds.
- Average error.
Leave-One-Out CV (LOOCV): . Expensive but low bias.
CV Score:
6. Residual Analysis
Residual Plot: Plot vs . Should show no pattern (random scatter).
Normal Q-Q Plot: Check normality of residuals.
Leverage: . High leverage points have unusual values.
Cook's Distance: Measures influence of each observation.
Large (typically ) indicates influential point.
VIF (Variance Inflation Factor): Detects multicollinearity. where is from regressing on other predictors. suggests multicollinearity.
Advanced Topics
1. Principal Component Regression (PCR)
Perform PCA on , then regress on principal components.
Steps:
- Compute PCA: .
- Project: (PC scores).
- Regress: .
- Back-transform: .
Advantage: Handles multicollinearity. Uses only top PCs (dimensionality reduction).
2. Partial Least Squares (PLS)
Like PCR but finds directions that maximize covariance with (supervised).
Advantage: Often better than PCR when .
3. Quantile Regression
Estimates conditional quantiles instead of conditional mean.
Objective (for -th quantile):
where is check function.
: Median regression (robust to outliers).
4. Robust Regression
Less sensitive to outliers.
Huber Loss: Combines L2 (for small residuals) and L1 (for large residuals).
RANSAC: Random Sample Consensus. Iteratively fit model to random subsets, find consensus.
5. Logistic Regression
For binary classification .
Model:
Log-Odds (Logit):
Loss (Negative Log-Likelihood):
where .
No closed-form solution. Optimize via Newton-Raphson or IRLS (Iteratively Reweighted Least Squares).
Regularization: L1 (Lasso) or L2 (Ridge) can be added.
6. Multinomial Logistic Regression (Softmax Regression)
Generalization to classes.
Model:
Loss (Cross-Entropy):
7. Generalized Linear Models (GLM)
Framework that extends linear regression to non-Gaussian responses.
Components:
- Random Component: distribution from exponential family.
- Systematic Component: Linear predictor .
- Link Function: where .
Examples:
- Linear Regression: Identity link, Gaussian distribution.
- Logistic Regression: Logit link, Binomial distribution.
- Poisson Regression: Log link, Poisson distribution.
Evaluation Metrics
1. Mean Squared Error (MSE)
2. Root Mean Squared Error (RMSE)
Same units as .
3. Mean Absolute Error (MAE)
More robust to outliers than MSE.
4. F1 Score (Classification)
Harmonic mean of Precision and Recall. Used for classification (especially imbalanced data).
5. ROC-AUC (Classification)
ROC Curve: Plot TPR vs FPR at various thresholds. AUC: Area Under Curve. is perfect, is random.
6. Log Loss (Classification)
Penalizes confident wrong predictions.