Preptide

Optimization

Mathematical optimization techniques: Gradient Descent, Lagrange Multipliers, ADMM

First-Order Methods

1. Gradient Descent

Iterative optimization algorithm for finding a local minimum of a differentiable function.

Objective: minwJ(w)\min_w J(w)

Update Rule: wt+1=wtηJ(wt)w_{t+1} = w_t - \eta \nabla J(w_t)

where η\eta is the learning rate.

Convergence (for convex JJ with L-Lipschitz gradient): J(wt)J(w)Lw0w22tJ(w_t) - J(w^*) \leq \frac{L \|w_0 - w^*\|^2}{2t}

2. Stochastic Gradient Descent (SGD)

Update using single sample (or mini-batch). High variance, faster per step.

Update: wt+1=wtηJi(wt)w_{t+1} = w_t - \eta \nabla J_i(w_t)

where ii is randomly selected.

Mini-Batch SGD: wt+1=wtη1BiBJi(wt)w_{t+1} = w_t - \eta \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla J_i(w_t)

where B\mathcal{B} is a random batch of size BB.

Advantages: Faster, can escape local minima, handles large datasets. Disadvantages: Noisy updates, requires learning rate tuning.

3. Momentum

Accumulate past gradients to dampen oscillations and accelerate convergence.

Update: vt+1=γvt+ηJ(wt)v_{t+1} = \gamma v_t + \eta \nabla J(w_t) wt+1=wtvt+1w_{t+1} = w_t - v_{t+1}

Typical γ=0.9\gamma = 0.9. Velocity vv accumulates gradient history.

Nesterov Accelerated Gradient (NAG): Look ahead before computing gradient: vt+1=γvt+ηJ(wtγvt)v_{t+1} = \gamma v_t + \eta \nabla J(w_t - \gamma v_t) wt+1=wtvt+1w_{t+1} = w_t - v_{t+1}

4. AdaGrad

Adaptive learning rate for each parameter based on past gradients.

Update: gt=J(wt)g_t = \nabla J(w_t) Gt=Gt1+gtgt(element-wise)G_t = G_{t-1} + g_t \odot g_t \quad \text{(element-wise)} wt+1=wtηGt+ϵgtw_{t+1} = w_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_t

Pros: Good for sparse data. Cons: Learning rate decays too aggressively (vanishing).

5. RMSProp

Addresses AdaGrad's learning rate decay issue using exponential moving average.

Update: E[g2]t=βE[g2]t1+(1β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2 wt+1=wtηE[g2]t+ϵgtw_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

Typical β=0.9\beta = 0.9.

6. Adam (Adaptive Moment Estimation)

Combines momentum and RMSProp. Most popular optimizer in deep learning.

Update: mt=β1mt1+(1β1)gt(first moment, momentum)m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment, momentum)} vt=β2vt1+(1β2)gt2(second moment, RMSProp)v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment, RMSProp)}

Bias Correction: m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

wt+1=wtηv^t+ϵm^tw_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Typical: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}.

Variants: AdaMax, Nadam, AMSGrad.

def adam(grad_fn, w_init, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, steps=1000):
    w = w_init
    m = np.zeros_like(w)
    v = np.zeros_like(w)
    
    for t in range(1, steps+1):
        g = grad_fn(w)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g ** 2)
        
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
    
    return w

Second-Order Methods

1. Newton's Method

Uses second-order (Hessian) information for faster convergence.

Update: wt+1=wtH1J(wt)w_{t+1} = w_t - H^{-1} \nabla J(w_t)

where H=2J(wt)H = \nabla^2 J(w_t) is the Hessian matrix.

Convergence: Quadratic near minimum (very fast). Drawback: Computing and inverting Hessian is O(n3)O(n^3).

2. Quasi-Newton Methods (BFGS, L-BFGS)

Approximate Hessian inverse using gradient information.

BFGS Update: Bt+1=Bt+ytytTytTstBtststTBtstTBtstB_{t+1} = B_t + \frac{y_t y_t^T}{y_t^T s_t} - \frac{B_t s_t s_t^T B_t}{s_t^T B_t s_t}

where st=wt+1wts_t = w_{t+1} - w_t, yt=J(wt+1)J(wt)y_t = \nabla J(w_{t+1}) - \nabla J(w_t).

L-BFGS: Limited-memory version. Stores only a few vectors instead of full matrix. Good for high dimensions.

3. Conjugate Gradient

Iterative method that uses conjugate directions instead of steepest descent.

Update: wt+1=wt+αtdtw_{t+1} = w_t + \alpha_t d_t

where dt=J(wt)+βtdt1d_t = -\nabla J(w_t) + \beta_t d_{t-1} (search direction).

Variants: Fletcher-Reeves, Polak-Ribière.

Constrained Optimization

1. Lagrange Multipliers

Strategy for finding local maxima/minima of a function subject to equality constraints.

Problem: Minimize f(x)f(x) subject to g(x)=0g(x) = 0.

Lagrangian: L(x,λ)=f(x)+λg(x)\mathcal{L}(x, \lambda) = f(x) + \lambda g(x)

KKT Conditions (Karush-Kuhn-Tucker): Solve L=0\nabla \mathcal{L} = 0:

  1. xL=f(x)+λg(x)=0\nabla_x \mathcal{L} = \nabla f(x) + \lambda \nabla g(x) = 0 (Stationarity)
  2. g(x)=0g(x) = 0 (Primal Feasibility)

Example: Maximize f(x,y)=xyf(x, y) = xy subject to x2+y2=1x^2 + y^2 = 1. L=xy+λ(x2+y21)\mathcal{L} = xy + \lambda(x^2 + y^2 - 1) Solution: x=y=±12x = y = \pm \frac{1}{\sqrt{2}}.

2. KKT Conditions (Inequality Constraints)

For inequality constraints gi(x)0g_i(x) \leq 0, hj(x)=0h_j(x) = 0.

Lagrangian: L(x,μ,λ)=f(x)+iμigi(x)+jλjhj(x)\mathcal{L}(x, \mu, \lambda) = f(x) + \sum_i \mu_i g_i(x) + \sum_j \lambda_j h_j(x)

Conditions:

  1. Stationarity: xL=0\nabla_x \mathcal{L} = 0
  2. Primal Feasibility: gi(x)0g_i(x) \leq 0, hj(x)=0h_j(x) = 0
  3. Dual Feasibility: μi0\mu_i \geq 0
  4. Complementary Slackness: μigi(x)=0\mu_i g_i(x) = 0

3. Penalty Methods

Convert constrained problem to unconstrained by adding penalty term.

Quadratic Penalty: minxf(x)+ρ2imax(0,gi(x))2\min_x f(x) + \frac{\rho}{2} \sum_i \max(0, g_i(x))^2

Increase ρ\rho over iterations.

4. Augmented Lagrangian Method

Combines Lagrange multipliers and penalty methods.

Lρ(x,λ)=f(x)+λTg(x)+ρ2g(x)2\mathcal{L}_{\rho}(x, \lambda) = f(x) + \lambda^T g(x) + \frac{\rho}{2} \|g(x)\|^2

Advantage: Finite ρ\rho suffices for convergence (unlike pure penalty).

Convex Optimization

1. Convex Functions

ff is convex if for all x,yx, y and θ[0,1]\theta \in [0, 1]: f(θx+(1θ)y)θf(x)+(1θ)f(y)f(\theta x + (1-\theta) y) \leq \theta f(x) + (1-\theta) f(y)

Properties:

  • Any local minimum is global minimum.
  • First-order condition: f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T (y - x).
  • Second-order condition: Hessian 2f(x)0\nabla^2 f(x) \succeq 0 (positive semidefinite).

2. Subgradients

Generalization of gradients for non-differentiable convex functions.

gg is a subgradient of ff at xx if: f(y)f(x)+gT(yx)yf(y) \geq f(x) + g^T(y - x) \quad \forall y

Example: For f(x)=xf(x) = |x|, subgradient at x=0x=0 is any g[1,1]g \in [-1, 1].

Subgradient Method: wt+1=wtηtgtw_{t+1} = w_t - \eta_t g_t where gtf(wt)g_t \in \partial f(w_t) (any subgradient).

3. Proximal Gradient Methods

For f(x)=g(x)+h(x)f(x) = g(x) + h(x) where gg is smooth, hh is non-smooth but has simple prox operator.

Proximal Operator: proxh(x)=argminz(h(z)+12zx2)\text{prox}_h(x) = \arg\min_z \left( h(z) + \frac{1}{2}\|z - x\|^2 \right)

ISTA (Iterative Shrinkage-Thresholding Algorithm): wt+1=proxηh(wtηg(wt))w_{t+1} = \text{prox}_{\eta h}(w_t - \eta \nabla g(w_t))

FISTA: Accelerated version with Nesterov momentum.

Example: Lasso regression h(w)=λw1h(w) = \lambda \|w\|_1 has soft-thresholding prox.

4. Coordinate Descent

Optimize one coordinate at a time, cycling through all coordinates.

Update: wi(t+1)=argminwif(w1(t+1),,wi1(t+1),wi,wi+1(t),)w_i^{(t+1)} = \arg\min_{w_i} f(w_1^{(t+1)}, \ldots, w_{i-1}^{(t+1)}, w_i, w_{i+1}^{(t)}, \ldots)

Use: Lasso, SVM, Matrix Factorization.

Distributed and Parallel Optimization

1. ADMM (Alternating Direction Method of Multipliers)

Algorithm that solves convex optimization problems by breaking them into smaller pieces. Combines benefits of Dual Decomposition and Augmented Lagrangian methods.

Problem: minx,zf(x)+g(z)s.t. Ax+Bz=c\min_{x, z} f(x) + g(z) \quad \text{s.t. } Ax + Bz = c

Augmented Lagrangian: Lρ(x,z,y)=f(x)+g(z)+yT(Ax+Bzc)+ρ2Ax+Bzc22\mathcal{L}_\rho(x, z, y) = f(x) + g(z) + y^T(Ax + Bz - c) + \frac{\rho}{2} \|Ax + Bz - c\|_2^2

Updates:

  1. x-update: xk+1:=argminxLρ(x,zk,yk)x^{k+1} := \arg\min_x \mathcal{L}_\rho(x, z^k, y^k)
  2. z-update: zk+1:=argminzLρ(xk+1,z,yk)z^{k+1} := \arg\min_z \mathcal{L}_\rho(x^{k+1}, z, y^k)
  3. y-update (Dual): yk+1:=yk+ρ(Axk+1+Bzk+1c)y^{k+1} := y^k + \rho(Ax^{k+1} + Bz^{k+1} - c)

Applications: Lasso, Robust PCA, Distributed Optimization, Consensus problems.

Convergence: Guaranteed for convex ff, gg.

2. Hogwild! (Asynchronous SGD)

Parallel SGD without locks. Multiple threads update parameters asynchronously.

Surprisingly: Works well when gradients are sparse (most updates don't conflict).

3. Parameter Server

Distributed architecture: Workers compute gradients, servers aggregate and update parameters.

Non-Convex Optimization

1. Simulated Annealing

Probabilistic technique inspired by annealing in metallurgy.

Accept worse solutions with probability exp(ΔE/T)\exp(-\Delta E / T) where TT decreases over time (temperature).

Allows escaping local minima.

2. Genetic Algorithms

Evolutionary approach: Population of solutions, select fittest, crossover, mutate, repeat.

3. Particle Swarm Optimization

Population of candidate solutions (particles) move in search space influenced by their own best position and global best.

4. Trust Region Methods

Define a region where model is trusted. Optimize model within region. Expand/shrink region based on agreement.

Choose step size α\alpha satisfying Armijo condition: f(x+αd)f(x)+cαf(x)Tdf(x + \alpha d) \leq f(x) + c \alpha \nabla f(x)^T d

where c(0,1)c \in (0, 1) (typically 0.1-0.3).

Algorithm: Start with α=1\alpha = 1, decrease by factor β<1\beta < 1 until condition satisfied.

2. Wolfe Conditions

Stronger than Armijo. Additionally requires: f(x+αd)Tdc2f(x)Td\nabla f(x + \alpha d)^T d \geq c_2 \nabla f(x)^T d

where c2(c1,1)c_2 \in (c_1, 1) (curvature condition).

Special Topics

1. Frank-Wolfe Algorithm (Conditional Gradient)

For constrained problems with simple linear optimization oracle.

Update: st=argminsDf(xt)Tss_t = \arg\min_{s \in \mathcal{D}} \nabla f(x_t)^T s xt+1=(1γt)xt+γtstx_{t+1} = (1-\gamma_t) x_t + \gamma_t s_t

Advantage: Maintains feasibility, projection-free.

2. Mirror Descent

Generalization of gradient descent using Bregman divergence.

Update: wt+1=argminw(J(wt)Tw+1ηDϕ(w,wt))w_{t+1} = \arg\min_w \left( \nabla J(w_t)^T w + \frac{1}{\eta} D_\phi(w, w_t) \right)

where Dϕ(w,wt)=ϕ(w)ϕ(wt)ϕ(wt)T(wwt)D_\phi(w, w_t) = \phi(w) - \phi(w_t) - \nabla \phi(w_t)^T(w - w_t) is Bregman divergence.

Choice of ϕ\phi: Euclidean (ϕ(w)=12w2\phi(w) = \frac{1}{2}\|w\|^2) gives standard GD. Entropy gives exponentiated gradient.

3. Natural Gradient Descent

Use Fisher Information Matrix as metric instead of Euclidean.

Update: θt+1=θtηI(θt)1J(θt)\theta_{t+1} = \theta_t - \eta I(\theta_t)^{-1} \nabla J(\theta_t)

Advantage: Invariant to reparameterization.

4. Variance Reduction (SVRG, SAGA)

Reduce variance of SGD while maintaining per-iteration cost.

SVRG (Stochastic Variance Reduced Gradient): Periodically compute full gradient, use it to correct stochastic gradients.

5. Hyperparameter Optimization

Grid Search: Exhaustive search over grid of hyperparameters.

Random Search: Sample random combinations. Often better than grid.

Bayesian Optimization: Model objective as Gaussian Process, use acquisition function (EI, UCB) to select next point.

Hyperband: Resource allocation strategy combining random search and early stopping.