Preptide

Machine Learning Algorithms

K-Means, KNN, SVM, PCA, t-SNE, and Johnson-Lindenstrauss

Unsupervised Learning

1. K-Means Clustering

Partition nn observations into kk clusters (S1,,SkS_1, \dots, S_k) to minimize within-cluster sum of squares (variance).

Objective: minSi=1kxSixμi2\min_S \sum_{i=1}^k \sum_{x \in S_i} \|x - \mu_i\|^2 where μi\mu_i is the mean of points in SiS_i.

Algorithm (Lloyd's):

  1. Initialization: Randomly select kk centroids (or use K-Means++).
  2. Assignment: Assign each observation to the cluster with the nearest mean. Si(t)={xp:xpμi(t)2xpμj(t)2j}S_i^{(t)} = \{x_p : \|x_p - \mu_i^{(t)}\|^2 \le \|x_p - \mu_j^{(t)}\|^2 \forall j\}
  3. Update: Recalculate means (centroids) for observations assigned to each cluster. μi(t+1)=1Si(t)xjSi(t)xj\mu_i^{(t+1)} = \frac{1}{|S_i^{(t)}|} \sum_{x_j \in S_i^{(t)}} x_j
  4. Repeat 2-3 until convergence.

K-Means++ Initialization: Choose first centroid randomly, then choose next with probability proportional to squared distance from nearest existing centroid. Guarantees O(logk)O(\log k)-approximation.

Complexity: O(nkdT)O(nkdT) where nn = points, kk = clusters, dd = dimensions, TT = iterations.

Limitations:

  • Assumes spherical clusters of similar size.
  • Sensitive to initialization and outliers.
  • Must specify kk in advance.

Choosing kk: Elbow method (plot inertia vs kk), Silhouette analysis.

2. Hierarchical Clustering

Build a tree of clusters.

Agglomerative (Bottom-Up):

  1. Start with each point as a cluster.
  2. Merge closest clusters.
  3. Repeat until one cluster remains.

Linkage Criteria:

  • Single: min{d(a,b):aA,bB}\min\{d(a, b) : a \in A, b \in B\} (can form long chains).
  • Complete: max{d(a,b):aA,bB}\max\{d(a, b) : a \in A, b \in B\} (compact clusters).
  • Average: Average of all pairwise distances.
  • Ward: Minimizes within-cluster variance (similar to K-Means).

Dendrogram: Tree visualization. Cut at desired height to get kk clusters.

Complexity: O(n3)O(n^3) or O(n2logn)O(n^2 \log n) with efficient data structures.

3. DBSCAN (Density-Based Spatial Clustering)

Clusters based on density. Can find arbitrary-shaped clusters and identify outliers.

Parameters:

  • ϵ\epsilon: Neighborhood radius.
  • MinPts: Minimum points to form dense region.

Point Types:

  • Core: Has \geq MinPts within ϵ\epsilon.
  • Border: Within ϵ\epsilon of core point but not core itself.
  • Noise: Neither core nor border.

Algorithm:

  1. For each unvisited point, check if it's a core point.
  2. If core, start a new cluster and expand by adding all reachable points.
  3. Points not assigned are noise.

Advantages: No need to specify kk, finds arbitrary shapes, handles noise. Disadvantages: Sensitive to ϵ\epsilon and MinPts, struggles with varying densities.

4. Gaussian Mixture Models (GMM)

Probabilistic model assuming data comes from mixture of Gaussians.

Model: P(x)=k=1KπkN(xμk,Σk)P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

where πk=1\sum \pi_k = 1 (mixing coefficients).

EM Algorithm for GMM:

  • E-Step: Compute responsibilities (posterior probabilities). γik=πkN(xiμk,Σk)j=1KπjN(xiμj,Σj)\gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}
  • M-Step: Update parameters using weighted MLE. πk=1niγik\pi_k = \frac{1}{n}\sum_i \gamma_{ik} μk=iγikxiiγik\mu_k = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}} Σk=iγik(xiμk)(xiμk)Tiγik\Sigma_k = \frac{\sum_i \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_i \gamma_{ik}}

Advantages: Soft assignments, probabilistic, can model elliptical clusters.

5. Principal Component Analysis (PCA)

Orthogonal linear transformation that projects data to a new coordinate system where the greatest variance comes to lie on the first coordinate (PC1), etc.

Objective: Find directions of maximum variance.

Algorithm (Covariance Method):

  1. Center data: XXXˉX \leftarrow X - \bar{X}.
  2. Compute Covariance Matrix: C=1n1XTXC = \frac{1}{n-1} X^T X.
  3. Eigen Decomposition: Cv=λvC v = \lambda v.
  4. Sort eigenvectors by eigenvalues (descending).
  5. Project XX onto top kk eigenvectors (WkW_k): Z=XWkZ = X W_k.

SVD Approach: X=UΣVTX = U \Sigma V^T Principal Components are columns of VV. PC scores are UΣU \Sigma.

Variance Explained: λijλj\frac{\lambda_i}{\sum_j \lambda_j}

Whitening: Transform to unit variance: Z=XWkΛk1/2Z = X W_k \Lambda_k^{-1/2} where Λk=diag(λ1,,λk)\Lambda_k = \text{diag}(\lambda_1, \ldots, \lambda_k).

Kernel PCA: Apply PCA in feature space via kernel trick. Captures nonlinear relationships.

6. Independent Component Analysis (ICA)

Finds statistically independent components. Unlike PCA, components are maximally independent (not just uncorrelated).

Cocktail Party Problem: Separate mixed audio sources.

Objective: Maximize non-Gaussianity (e.g., kurtosis, negentropy).

Algorithms: FastICA, Infomax.

7. t-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear dimensionality reduction for visualization. Minimizes KL divergence between joint probabilities in high-dimensional space (PP) and low-dimensional space (QQ).

High-dim similarity (pjip_{j|i}): Gaussian. pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}

Symmetrize: pij=pji+pij2np_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}.

Low-dim similarity (qijq_{ij}): Student t-distribution (1 degree of freedom). qij=(1+yiyj2)1kl(1+ykyl2)1q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}

Objective: minC=iDKL(PiQi)=ijpijlogpijqij\min C = \sum_i D_{KL}(P_i \| Q_i) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

Optimized via Gradient Descent with momentum.

Perplexity: Balances local vs global structure. Typical: 5-50.

Properties:

  • Preserves local structure well.
  • Computationally expensive (O(n2)O(n^2), can use Barnes-Hut for O(nlogn)O(n \log n)).
  • Non-convex, sensitive to hyperparameters.

8. UMAP (Uniform Manifold Approximation and Projection)

Modern alternative to t-SNE. Faster, preserves global structure better.

Based on: Riemannian geometry and algebraic topology.

Advantages: Faster, scales better, preserves both local and global structure.

9. Autoencoders

Neural network that learns compressed representation.

Architecture: Encoder ff, Decoder gg. Minimize reconstruction error: minixig(f(xi))2\min \sum_i \|x_i - g(f(x_i))\|^2

Variants:

  • Variational Autoencoder (VAE): Learns probabilistic latent space. Regularized with KL divergence.
  • Denoising Autoencoder: Trained to reconstruct from corrupted input.
  • Sparse Autoencoder: Adds sparsity penalty on hidden activations.

10. Dimensionality Reduction Theory

Johnson-Lindenstrauss Lemma: A set of mm points in high-dimensional space RN\mathbb{R}^N can be embedded into Rk\mathbb{R}^k (where k<Nk < N) such that distances between points are nearly preserved.

Bound: k4lnmϵ2/2ϵ3/3k \ge \frac{4 \ln m}{\epsilon^2 / 2 - \epsilon^3 / 3}

Distances preserved within factor (1±ϵ)(1 \pm \epsilon) with high probability.

Method: Random Projection using a matrix RRk×NR \in \mathbb{R}^{k \times N} with entries N(0,1/k)\sim \mathcal{N}(0, 1/k) or Rademacher (±1/k\pm 1/\sqrt{k}).

Applications: Fast nearest neighbor search, compressed sensing.

Supervised Learning

1. K-Nearest Neighbors (KNN)

Instance-based learning. Classify new point based on majority vote of kk nearest neighbors.

Algorithm:

  1. Compute distances from xnewx_{new} to all training points.
  2. Select kk nearest neighbors.
  3. Classification: Majority vote. Regression: Average of neighbors' values.

Distance Metrics:

  • Euclidean: xy2=(xiyi)2\|x-y\|_2 = \sqrt{\sum (x_i - y_i)^2}
  • Manhattan: xy1=xiyi\|x-y\|_1 = \sum |x_i - y_i|
  • Minkowski: xyp=(xiyip)1/p\|x-y\|_p = (\sum |x_i - y_i|^p)^{1/p}
  • Cosine: 1xyxy1 - \frac{x \cdot y}{\|x\| \|y\|}
  • Mahalanobis: (xy)TΣ1(xy)\sqrt{(x-y)^T \Sigma^{-1} (x-y)} (accounts for covariance)

Choosing kk:

  • Small kk: Low bias, high variance (sensitive to noise).
  • Large kk: High bias, low variance (smooth boundaries).
  • Use cross-validation.

Weighted KNN: Weight by inverse distance.

Complexity: O(nd)O(nd) per query (naive). Can use KD-Tree or Ball Tree for O(logn)O(\log n) in low dimensions.

Curse of Dimensionality: Performance degrades in high dimensions as distances become similar.

2. Decision Trees

Hierarchical model that splits data based on feature values.

CART (Classification and Regression Trees):

Splitting Criterion (Classification):

  • Gini Impurity: G=1i=1Cpi2G = 1 - \sum_{i=1}^C p_i^2 where pip_i is fraction of class ii.
  • Entropy: H=i=1CpilogpiH = -\sum_{i=1}^C p_i \log p_i.
  • Misclassification Rate: 1maxipi1 - \max_i p_i.

Splitting Criterion (Regression):

  • Variance Reduction: Choose split that minimizes left(yiyˉleft)2+right(yiyˉright)2\sum_{left} (y_i - \bar{y}_{left})^2 + \sum_{right} (y_i - \bar{y}_{right})^2.

Algorithm:

  1. Find best split (feature and threshold) that maximizes information gain.
  2. Recursively split child nodes.
  3. Stop when: max depth reached, min samples per leaf, or no improvement.

Pruning: Remove branches to prevent overfitting.

  • Pre-pruning: Stop early based on criteria.
  • Post-pruning: Grow full tree, then remove branches (cost-complexity pruning).

Advantages: Interpretable, handles non-linear relationships, no scaling needed. Disadvantages: High variance (unstable), can overfit, axis-aligned splits only.

3. Random Forest

Ensemble of decision trees. Reduces variance via bagging and feature randomness.

Algorithm:

  1. For b=1b = 1 to BB (number of trees):
    • Sample nn training examples with replacement (bootstrap).
    • At each split, consider random subset of mpm \approx \sqrt{p} features.
    • Grow tree to maximum depth (no pruning).
  2. Prediction: Average (regression) or majority vote (classification).

Out-of-Bag (OOB) Error: Evaluate each tree on samples not in its bootstrap sample. Provides validation without separate test set.

Feature Importance: Measure decrease in impurity when splitting on feature, averaged across all trees.

Advantages: Low variance, parallel, handles high-dimensional data, robust to outliers. Disadvantages: Less interpretable, can be slow for large forests.

4. Gradient Boosting

Ensemble method that builds trees sequentially, each correcting errors of previous ones.

Algorithm (for regression):

  1. Initialize model with constant: F0(x)=argminγi=1nL(yi,γ)F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma).
  2. For m=1m = 1 to MM:
    • Compute pseudo-residuals: rim=[L(yi,F(xi))F(xi)]F=Fm1r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}.
    • Fit regression tree to rimr_{im} giving terminal regions RjmR_{jm}.
    • For each region, compute: γjm=argminγxiRjmL(yi,Fm1(xi)+γ)\gamma_{jm} = \arg\min_\gamma \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma).
    • Update: Fm(x)=Fm1(x)+νjγjm1xRjmF_m(x) = F_{m-1}(x) + \nu \sum_{j} \gamma_{jm} \mathbb{1}_{x \in R_{jm}}.

where ν\nu is learning rate (shrinkage).

Variants:

  • XGBoost: Adds regularization, parallel processing, handles missing values.
  • LightGBM: Histogram-based, leaf-wise growth. Very fast.
  • CatBoost: Handles categorical features natively, reduces overfitting.

Advantages: State-of-the-art performance on tabular data, handles mixed data types. Disadvantages: Prone to overfitting if not tuned, sequential (slow to train).

5. Support Vector Machines (SVM)

Finds hyperplane that maximizes the margin between classes.

Primal Problem (Hard Margin): minw,b12w2\min_{w, b} \frac{1}{2} \|w\|^2 s.t. yi(wTxi+b)1,i\text{s.t. } y_i(w^T x_i + b) \ge 1, \quad \forall i

Margin: 2w\frac{2}{\|w\|}. Maximizing margin is equivalent to minimizing w2\|w\|^2.

Support Vectors: Training points on the margin. Only these affect the solution.

Soft Margin (Hinge Loss): Allow some misclassifications with slack variables ξi\xi_i.

minw,b12w2+Ci=1nξi\min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i s.t. yi(wTxi+b)1ξi,ξi0\text{s.t. } y_i(w^T x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0

CC: Regularization parameter. Large CC = less regularization (hard margin). Small CC = more regularization (soft margin).

Dual Formulation (using Lagrange multipliers): maxαi=1nαi12i,jαiαjyiyjxiTxj\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j s.t. 0αiC,i=1nαiyi=0\text{s.t. } 0 \le \alpha_i \le C, \quad \sum_{i=1}^n \alpha_i y_i = 0

Decision Function: f(x)=sign(iSVαiyixiTx+b)f(x) = \text{sign}\left(\sum_{i \in SV} \alpha_i y_i x_i^T x + b\right)

Kernel Trick: Map xϕ(x)x \to \phi(x). Compute dot products in high-dim space using K(xi,xj)=ϕ(xi)Tϕ(xj)K(x_i, x_j) = \phi(x_i)^T \phi(x_j) without explicitly computing ϕ\phi.

Common Kernels:

  • Linear: K(x,y)=xTyK(x, y) = x^T y
  • Polynomial: K(x,y)=(xTy+c)dK(x, y) = (x^T y + c)^d
  • RBF (Gaussian): K(x,y)=exp(γxy2)K(x, y) = \exp(-\gamma \|x - y\|^2)
  • Sigmoid: K(x,y)=tanh(αxTy+c)K(x, y) = \tanh(\alpha x^T y + c)

RBF γ\gamma parameter: Controls influence radius. Small γ\gamma = wide influence (smooth). Large γ\gamma = narrow influence (complex boundary).

Multi-Class SVM:

  • One-vs-Rest (OvR): Train KK binary classifiers.
  • One-vs-One (OvO): Train (K2)\binom{K}{2} classifiers for each pair.

Advantages: Effective in high dimensions, memory efficient (only uses support vectors), versatile (kernels). Disadvantages: Slow for large datasets (O(n2)O(n^2) to O(n3)O(n^3)), sensitive to feature scaling and C,γC, \gamma.

6. Naive Bayes

See Statistics section. Probabilistic classifier based on Bayes' Theorem with independence assumptions.

7. Ensemble Methods

Bagging (Bootstrap Aggregating): Train models on bootstrap samples, average predictions. Reduces variance.

Boosting: Train models sequentially, each correcting previous errors. Reduces bias. Examples: AdaBoost, Gradient Boosting.

Stacking: Train meta-model on predictions of base models.

Voting:

  • Hard Voting: Majority vote.
  • Soft Voting: Average predicted probabilities.

Recommendation Systems

1. Collaborative Filtering

Recommend based on user-item interactions.

User-Based: Find similar users, recommend items they liked.

Item-Based: Find similar items, recommend items similar to what user liked.

Similarity Measures: Cosine, Pearson correlation.

2. Matrix Factorization

Decompose user-item matrix RUVTR \approx UV^T where UU (users) and VV (items) are low-rank.

Objective: minU,V(i,j)observed(RijUiTVj)2+λ(UF2+VF2)\min_{U, V} \sum_{(i,j) \in \text{observed}} (R_{ij} - U_i^T V_j)^2 + \lambda(\|U\|_F^2 + \|V\|_F^2)

Algorithm: ALS (Alternating Least Squares), SGD.

SVD: Special case of matrix factorization.

3. Content-Based Filtering

Recommend based on item features and user profile.

4. Hybrid Methods

Combine collaborative and content-based filtering.

Model Evaluation

1. Confusion Matrix

For binary classification:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)

Accuracy: TP+TNTP+TN+FP+FN\frac{TP + TN}{TP + TN + FP + FN}

Precision: TPTP+FP\frac{TP}{TP + FP} (of predicted positives, how many are correct?)

Recall (Sensitivity, TPR): TPTP+FN\frac{TP}{TP + FN} (of actual positives, how many did we catch?)

Specificity (TNR): TNTN+FP\frac{TN}{TN + FP} (of actual negatives, how many did we correctly identify?)

F1 Score: 2TP2TP+FP+FN\frac{2TP}{2TP + FP + FN} (harmonic mean of precision and recall).

F-beta Score: (1+β2)PrecisionRecallβ2Precision+Recall\frac{(1+\beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}. β>1\beta > 1 weighs recall higher.

2. ROC and AUC

ROC Curve: Plot TPR (Recall) vs FPR at various thresholds.

AUC (Area Under Curve): Higher is better. AUC=1AUC = 1 is perfect, AUC=0.5AUC = 0.5 is random.

Interpretation: Probability that a random positive is scored higher than a random negative.

3. Precision-Recall Curve

Useful for imbalanced datasets. Plot Precision vs Recall at various thresholds.

Average Precision (AP): Area under PR curve.

4. Calibration

Do predicted probabilities match empirical frequencies?

Calibration Curve: Plot predicted probability vs observed frequency.

Brier Score: 1n(piyi)2\frac{1}{n}\sum (p_i - y_i)^2 (lower is better). Measures calibration + discrimination.