Optimization
Mathematical optimization techniques: Gradient Descent, Lagrange Multipliers, ADMM
First-Order Methods
1. Gradient Descent
Iterative optimization algorithm for finding a local minimum of a differentiable function.
Objective:
Update Rule:
where is the learning rate.
Convergence (for convex with L-Lipschitz gradient):
2. Stochastic Gradient Descent (SGD)
Update using single sample (or mini-batch). High variance, faster per step.
Update:
where is randomly selected.
Mini-Batch SGD:
where is a random batch of size .
Advantages: Faster, can escape local minima, handles large datasets. Disadvantages: Noisy updates, requires learning rate tuning.
3. Momentum
Accumulate past gradients to dampen oscillations and accelerate convergence.
Update:
Typical . Velocity accumulates gradient history.
Nesterov Accelerated Gradient (NAG): Look ahead before computing gradient:
4. AdaGrad
Adaptive learning rate for each parameter based on past gradients.
Update:
Pros: Good for sparse data. Cons: Learning rate decays too aggressively (vanishing).
5. RMSProp
Addresses AdaGrad's learning rate decay issue using exponential moving average.
Update:
Typical .
6. Adam (Adaptive Moment Estimation)
Combines momentum and RMSProp. Most popular optimizer in deep learning.
Update:
Bias Correction:
Typical: , , .
Variants: AdaMax, Nadam, AMSGrad.
def adam(grad_fn, w_init, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, steps=1000):
w = w_init
m = np.zeros_like(w)
v = np.zeros_like(w)
for t in range(1, steps+1):
g = grad_fn(w)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * (g ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
return wSecond-Order Methods
1. Newton's Method
Uses second-order (Hessian) information for faster convergence.
Update:
where is the Hessian matrix.
Convergence: Quadratic near minimum (very fast). Drawback: Computing and inverting Hessian is .
2. Quasi-Newton Methods (BFGS, L-BFGS)
Approximate Hessian inverse using gradient information.
BFGS Update:
where , .
L-BFGS: Limited-memory version. Stores only a few vectors instead of full matrix. Good for high dimensions.
3. Conjugate Gradient
Iterative method that uses conjugate directions instead of steepest descent.
Update:
where (search direction).
Variants: Fletcher-Reeves, Polak-Ribière.
Constrained Optimization
1. Lagrange Multipliers
Strategy for finding local maxima/minima of a function subject to equality constraints.
Problem: Minimize subject to .
Lagrangian:
KKT Conditions (Karush-Kuhn-Tucker): Solve :
- (Stationarity)
- (Primal Feasibility)
Example: Maximize subject to . Solution: .
2. KKT Conditions (Inequality Constraints)
For inequality constraints , .
Lagrangian:
Conditions:
- Stationarity:
- Primal Feasibility: ,
- Dual Feasibility:
- Complementary Slackness:
3. Penalty Methods
Convert constrained problem to unconstrained by adding penalty term.
Quadratic Penalty:
Increase over iterations.
4. Augmented Lagrangian Method
Combines Lagrange multipliers and penalty methods.
Advantage: Finite suffices for convergence (unlike pure penalty).
Convex Optimization
1. Convex Functions
is convex if for all and :
Properties:
- Any local minimum is global minimum.
- First-order condition: .
- Second-order condition: Hessian (positive semidefinite).
2. Subgradients
Generalization of gradients for non-differentiable convex functions.
is a subgradient of at if:
Example: For , subgradient at is any .
Subgradient Method: where (any subgradient).
3. Proximal Gradient Methods
For where is smooth, is non-smooth but has simple prox operator.
Proximal Operator:
ISTA (Iterative Shrinkage-Thresholding Algorithm):
FISTA: Accelerated version with Nesterov momentum.
Example: Lasso regression has soft-thresholding prox.
4. Coordinate Descent
Optimize one coordinate at a time, cycling through all coordinates.
Update:
Use: Lasso, SVM, Matrix Factorization.
Distributed and Parallel Optimization
1. ADMM (Alternating Direction Method of Multipliers)
Algorithm that solves convex optimization problems by breaking them into smaller pieces. Combines benefits of Dual Decomposition and Augmented Lagrangian methods.
Problem:
Augmented Lagrangian:
Updates:
- x-update:
- z-update:
- y-update (Dual):
Applications: Lasso, Robust PCA, Distributed Optimization, Consensus problems.
Convergence: Guaranteed for convex , .
2. Hogwild! (Asynchronous SGD)
Parallel SGD without locks. Multiple threads update parameters asynchronously.
Surprisingly: Works well when gradients are sparse (most updates don't conflict).
3. Parameter Server
Distributed architecture: Workers compute gradients, servers aggregate and update parameters.
Non-Convex Optimization
1. Simulated Annealing
Probabilistic technique inspired by annealing in metallurgy.
Accept worse solutions with probability where decreases over time (temperature).
Allows escaping local minima.
2. Genetic Algorithms
Evolutionary approach: Population of solutions, select fittest, crossover, mutate, repeat.
3. Particle Swarm Optimization
Population of candidate solutions (particles) move in search space influenced by their own best position and global best.
4. Trust Region Methods
Define a region where model is trusted. Optimize model within region. Expand/shrink region based on agreement.
Line Search
1. Backtracking Line Search
Choose step size satisfying Armijo condition:
where (typically 0.1-0.3).
Algorithm: Start with , decrease by factor until condition satisfied.
2. Wolfe Conditions
Stronger than Armijo. Additionally requires:
where (curvature condition).
Special Topics
1. Frank-Wolfe Algorithm (Conditional Gradient)
For constrained problems with simple linear optimization oracle.
Update:
Advantage: Maintains feasibility, projection-free.
2. Mirror Descent
Generalization of gradient descent using Bregman divergence.
Update:
where is Bregman divergence.
Choice of : Euclidean () gives standard GD. Entropy gives exponentiated gradient.
3. Natural Gradient Descent
Use Fisher Information Matrix as metric instead of Euclidean.
Update:
Advantage: Invariant to reparameterization.
4. Variance Reduction (SVRG, SAGA)
Reduce variance of SGD while maintaining per-iteration cost.
SVRG (Stochastic Variance Reduced Gradient): Periodically compute full gradient, use it to correct stochastic gradients.
5. Hyperparameter Optimization
Grid Search: Exhaustive search over grid of hyperparameters.
Random Search: Sample random combinations. Often better than grid.
Bayesian Optimization: Model objective as Gaussian Process, use acquisition function (EI, UCB) to select next point.
Hyperband: Resource allocation strategy combining random search and early stopping.