Machine Learning Algorithms
K-Means, KNN, SVM, PCA, t-SNE, and Johnson-Lindenstrauss
Unsupervised Learning
1. K-Means Clustering
Partition observations into clusters () to minimize within-cluster sum of squares (variance).
Objective: where is the mean of points in .
Algorithm (Lloyd's):
- Initialization: Randomly select centroids (or use K-Means++).
- Assignment: Assign each observation to the cluster with the nearest mean.
- Update: Recalculate means (centroids) for observations assigned to each cluster.
- Repeat 2-3 until convergence.
K-Means++ Initialization: Choose first centroid randomly, then choose next with probability proportional to squared distance from nearest existing centroid. Guarantees -approximation.
Complexity: where = points, = clusters, = dimensions, = iterations.
Limitations:
- Assumes spherical clusters of similar size.
- Sensitive to initialization and outliers.
- Must specify in advance.
Choosing : Elbow method (plot inertia vs ), Silhouette analysis.
2. Hierarchical Clustering
Build a tree of clusters.
Agglomerative (Bottom-Up):
- Start with each point as a cluster.
- Merge closest clusters.
- Repeat until one cluster remains.
Linkage Criteria:
- Single: (can form long chains).
- Complete: (compact clusters).
- Average: Average of all pairwise distances.
- Ward: Minimizes within-cluster variance (similar to K-Means).
Dendrogram: Tree visualization. Cut at desired height to get clusters.
Complexity: or with efficient data structures.
3. DBSCAN (Density-Based Spatial Clustering)
Clusters based on density. Can find arbitrary-shaped clusters and identify outliers.
Parameters:
- : Neighborhood radius.
- MinPts: Minimum points to form dense region.
Point Types:
- Core: Has MinPts within .
- Border: Within of core point but not core itself.
- Noise: Neither core nor border.
Algorithm:
- For each unvisited point, check if it's a core point.
- If core, start a new cluster and expand by adding all reachable points.
- Points not assigned are noise.
Advantages: No need to specify , finds arbitrary shapes, handles noise. Disadvantages: Sensitive to and MinPts, struggles with varying densities.
4. Gaussian Mixture Models (GMM)
Probabilistic model assuming data comes from mixture of Gaussians.
Model:
where (mixing coefficients).
EM Algorithm for GMM:
- E-Step: Compute responsibilities (posterior probabilities).
- M-Step: Update parameters using weighted MLE.
Advantages: Soft assignments, probabilistic, can model elliptical clusters.
5. Principal Component Analysis (PCA)
Orthogonal linear transformation that projects data to a new coordinate system where the greatest variance comes to lie on the first coordinate (PC1), etc.
Objective: Find directions of maximum variance.
Algorithm (Covariance Method):
- Center data: .
- Compute Covariance Matrix: .
- Eigen Decomposition: .
- Sort eigenvectors by eigenvalues (descending).
- Project onto top eigenvectors (): .
SVD Approach: Principal Components are columns of . PC scores are .
Variance Explained:
Whitening: Transform to unit variance: where .
Kernel PCA: Apply PCA in feature space via kernel trick. Captures nonlinear relationships.
6. Independent Component Analysis (ICA)
Finds statistically independent components. Unlike PCA, components are maximally independent (not just uncorrelated).
Cocktail Party Problem: Separate mixed audio sources.
Objective: Maximize non-Gaussianity (e.g., kurtosis, negentropy).
Algorithms: FastICA, Infomax.
7. t-SNE (t-Distributed Stochastic Neighbor Embedding)
Non-linear dimensionality reduction for visualization. Minimizes KL divergence between joint probabilities in high-dimensional space () and low-dimensional space ().
High-dim similarity (): Gaussian.
Symmetrize: .
Low-dim similarity (): Student t-distribution (1 degree of freedom).
Objective:
Optimized via Gradient Descent with momentum.
Perplexity: Balances local vs global structure. Typical: 5-50.
Properties:
- Preserves local structure well.
- Computationally expensive (, can use Barnes-Hut for ).
- Non-convex, sensitive to hyperparameters.
8. UMAP (Uniform Manifold Approximation and Projection)
Modern alternative to t-SNE. Faster, preserves global structure better.
Based on: Riemannian geometry and algebraic topology.
Advantages: Faster, scales better, preserves both local and global structure.
9. Autoencoders
Neural network that learns compressed representation.
Architecture: Encoder , Decoder . Minimize reconstruction error:
Variants:
- Variational Autoencoder (VAE): Learns probabilistic latent space. Regularized with KL divergence.
- Denoising Autoencoder: Trained to reconstruct from corrupted input.
- Sparse Autoencoder: Adds sparsity penalty on hidden activations.
10. Dimensionality Reduction Theory
Johnson-Lindenstrauss Lemma: A set of points in high-dimensional space can be embedded into (where ) such that distances between points are nearly preserved.
Bound:
Distances preserved within factor with high probability.
Method: Random Projection using a matrix with entries or Rademacher ().
Applications: Fast nearest neighbor search, compressed sensing.
Supervised Learning
1. K-Nearest Neighbors (KNN)
Instance-based learning. Classify new point based on majority vote of nearest neighbors.
Algorithm:
- Compute distances from to all training points.
- Select nearest neighbors.
- Classification: Majority vote. Regression: Average of neighbors' values.
Distance Metrics:
- Euclidean:
- Manhattan:
- Minkowski:
- Cosine:
- Mahalanobis: (accounts for covariance)
Choosing :
- Small : Low bias, high variance (sensitive to noise).
- Large : High bias, low variance (smooth boundaries).
- Use cross-validation.
Weighted KNN: Weight by inverse distance.
Complexity: per query (naive). Can use KD-Tree or Ball Tree for in low dimensions.
Curse of Dimensionality: Performance degrades in high dimensions as distances become similar.
2. Decision Trees
Hierarchical model that splits data based on feature values.
CART (Classification and Regression Trees):
Splitting Criterion (Classification):
- Gini Impurity: where is fraction of class .
- Entropy: .
- Misclassification Rate: .
Splitting Criterion (Regression):
- Variance Reduction: Choose split that minimizes .
Algorithm:
- Find best split (feature and threshold) that maximizes information gain.
- Recursively split child nodes.
- Stop when: max depth reached, min samples per leaf, or no improvement.
Pruning: Remove branches to prevent overfitting.
- Pre-pruning: Stop early based on criteria.
- Post-pruning: Grow full tree, then remove branches (cost-complexity pruning).
Advantages: Interpretable, handles non-linear relationships, no scaling needed. Disadvantages: High variance (unstable), can overfit, axis-aligned splits only.
3. Random Forest
Ensemble of decision trees. Reduces variance via bagging and feature randomness.
Algorithm:
- For to (number of trees):
- Sample training examples with replacement (bootstrap).
- At each split, consider random subset of features.
- Grow tree to maximum depth (no pruning).
- Prediction: Average (regression) or majority vote (classification).
Out-of-Bag (OOB) Error: Evaluate each tree on samples not in its bootstrap sample. Provides validation without separate test set.
Feature Importance: Measure decrease in impurity when splitting on feature, averaged across all trees.
Advantages: Low variance, parallel, handles high-dimensional data, robust to outliers. Disadvantages: Less interpretable, can be slow for large forests.
4. Gradient Boosting
Ensemble method that builds trees sequentially, each correcting errors of previous ones.
Algorithm (for regression):
- Initialize model with constant: .
- For to :
- Compute pseudo-residuals: .
- Fit regression tree to giving terminal regions .
- For each region, compute: .
- Update: .
where is learning rate (shrinkage).
Variants:
- XGBoost: Adds regularization, parallel processing, handles missing values.
- LightGBM: Histogram-based, leaf-wise growth. Very fast.
- CatBoost: Handles categorical features natively, reduces overfitting.
Advantages: State-of-the-art performance on tabular data, handles mixed data types. Disadvantages: Prone to overfitting if not tuned, sequential (slow to train).
5. Support Vector Machines (SVM)
Finds hyperplane that maximizes the margin between classes.
Primal Problem (Hard Margin):
Margin: . Maximizing margin is equivalent to minimizing .
Support Vectors: Training points on the margin. Only these affect the solution.
Soft Margin (Hinge Loss): Allow some misclassifications with slack variables .
: Regularization parameter. Large = less regularization (hard margin). Small = more regularization (soft margin).
Dual Formulation (using Lagrange multipliers):
Decision Function:
Kernel Trick: Map . Compute dot products in high-dim space using without explicitly computing .
Common Kernels:
- Linear:
- Polynomial:
- RBF (Gaussian):
- Sigmoid:
RBF parameter: Controls influence radius. Small = wide influence (smooth). Large = narrow influence (complex boundary).
Multi-Class SVM:
- One-vs-Rest (OvR): Train binary classifiers.
- One-vs-One (OvO): Train classifiers for each pair.
Advantages: Effective in high dimensions, memory efficient (only uses support vectors), versatile (kernels). Disadvantages: Slow for large datasets ( to ), sensitive to feature scaling and .
6. Naive Bayes
See Statistics section. Probabilistic classifier based on Bayes' Theorem with independence assumptions.
7. Ensemble Methods
Bagging (Bootstrap Aggregating): Train models on bootstrap samples, average predictions. Reduces variance.
Boosting: Train models sequentially, each correcting previous errors. Reduces bias. Examples: AdaBoost, Gradient Boosting.
Stacking: Train meta-model on predictions of base models.
Voting:
- Hard Voting: Majority vote.
- Soft Voting: Average predicted probabilities.
Recommendation Systems
1. Collaborative Filtering
Recommend based on user-item interactions.
User-Based: Find similar users, recommend items they liked.
Item-Based: Find similar items, recommend items similar to what user liked.
Similarity Measures: Cosine, Pearson correlation.
2. Matrix Factorization
Decompose user-item matrix where (users) and (items) are low-rank.
Objective:
Algorithm: ALS (Alternating Least Squares), SGD.
SVD: Special case of matrix factorization.
3. Content-Based Filtering
Recommend based on item features and user profile.
4. Hybrid Methods
Combine collaborative and content-based filtering.
Model Evaluation
1. Confusion Matrix
For binary classification:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Accuracy:
Precision: (of predicted positives, how many are correct?)
Recall (Sensitivity, TPR): (of actual positives, how many did we catch?)
Specificity (TNR): (of actual negatives, how many did we correctly identify?)
F1 Score: (harmonic mean of precision and recall).
F-beta Score: . weighs recall higher.
2. ROC and AUC
ROC Curve: Plot TPR (Recall) vs FPR at various thresholds.
AUC (Area Under Curve): Higher is better. is perfect, is random.
Interpretation: Probability that a random positive is scored higher than a random negative.
3. Precision-Recall Curve
Useful for imbalanced datasets. Plot Precision vs Recall at various thresholds.
Average Precision (AP): Area under PR curve.
4. Calibration
Do predicted probabilities match empirical frequencies?
Calibration Curve: Plot predicted probability vs observed frequency.
Brier Score: (lower is better). Measures calibration + discrimination.