Neural Networks
Deep Learning foundations: Backpropagation, Activations, and Loss Functions
Fundamentals
1. The Perceptron
Basic unit of a neural network. Linear transformation followed by activation.
Forward Pass:
where is an activation function, are weights, is bias.
Decision Boundary (binary classification): Hyperplane .
2. Multi-Layer Perceptron (MLP)
Stack of layers. Each layer: .
Universal Approximation Theorem: A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of (with mild assumptions on activation function).
Activation Functions
1. Sigmoid
Squashes output to . Used for binary classification probabilities.
Derivative:
Pros: Smooth, outputs interpretable as probabilities. Cons: Vanishing gradients (), not zero-centered, computationally expensive.
2. Tanh (Hyperbolic Tangent)
Squashes output to . Zero-centered (better than Sigmoid).
Derivative:
Pros: Zero-centered. Cons: Still suffers from vanishing gradients.
3. ReLU (Rectified Linear Unit)
Standard for hidden layers. Introduced non-linearity while being computationally efficient.
Derivative:
Pros:
- Solves vanishing gradient (for ).
- Computationally efficient.
- Sparse activation (neurons with negative input are inactive).
- Empirically converges faster than Sigmoid/Tanh.
Cons:
- Dead ReLU: Neurons can die if inputs always . Gradient is always 0, so weights never update.
- Not zero-centered.
4. Leaky ReLU
Addresses dead ReLU problem by allowing small negative slope.
Typical .
Parametric ReLU (PReLU): Learn as a parameter.
5. ELU (Exponential Linear Unit)
Smooth approximation to ReLU with negative values.
Pros: Mean activation closer to zero, smooth, no dead neurons. Cons: Computationally more expensive (exponential).
6. GELU (Gaussian Error Linear Unit)
Used in Transformers (BERT, GPT).
where is Gaussian CDF. Approximation: .
7. Swish (SiLU)
Self-gated activation.
Pros: Smooth, non-monotonic, performs well in deep networks.
8. Softmax
Generalization of Sigmoid to Multi-Class ( classes). Output vector sums to 1 (probability distribution).
Derivative (Jacobian):
Numerical Stability: Subtract max before exp.
9. Softplus
Smooth approximation to ReLU.
Derivative: Sigmoid.
Loss Functions
1. Mean Squared Error (MSE)
Standard for regression.
Gradient:
2. Mean Absolute Error (MAE)
More robust to outliers than MSE.
Gradient: (discontinuous at 0).
3. Huber Loss
Combines MSE (for small errors) and MAE (for large errors).
4. Binary Cross-Entropy
Standard for binary classification.
Gradient (w.r.t. logits if ):
5. Categorical Cross-Entropy
Standard for multi-class classification.
where is one-hot encoded.
With Softmax:
Ideally simple gradient: prediction - target.
6. Hinge Loss (SVM Loss)
For maximum-margin classification.
where .
Multi-Class Hinge: where are class scores.
7. Kullback-Leibler Divergence
Measures difference between two probability distributions.
Used in VAEs (regularization term).
8. Focal Loss
Addresses class imbalance by down-weighting easy examples.
(focusing parameter, typically 2) reduces loss for well-classified examples.
Backpropagation
Algorithm to compute gradients of loss function with respect to weights using Chain Rule. Enables efficient training of deep networks.
Goal: Efficiently compute for all weights.
Forward Pass (Computation Graph):
Backward Pass (Error Propagation): The core idea is to compute the "error" at layer , which represents sensitivity of loss to pre-activations.
-
Output Layer Error:
For MSE: .
For Cross-Entropy + Softmax: (simplified gradient).
-
Hidden Layer Error: Propagate error backwards using weights and activation derivative.
-
Gradients: Using the computed :
Why Backward?: Since a neural network is a composite function , the Chain Rule multiplies derivatives from the outside in (output to input). Computing gradients forward would require a Jacobian matrix multiplication for every layer, which is computationally expensive ( or ) compared to the backward vector-matrix products ( where is number of weights). Backprop is essentially Reverse-Mode Automatic Differentiation.
Computational Graph: Represent computation as directed acyclic graph. Forward pass computes values, backward pass computes gradients.
Regularization Techniques
1. L2 Regularization (Weight Decay)
Add penalty term to loss:
Effect: Weights decay toward zero. Prevents large weights, reduces overfitting.
Update (in SGD):
2. L1 Regularization
Effect: Encourages sparsity (some weights become exactly zero).
3. Dropout
During training, randomly set fraction of neurons to zero.
Training: where mask .
Test: Use all neurons but scale activations by (or use inverted dropout during training).
Effect: Prevents co-adaptation of neurons, acts as ensemble of networks.
Typical: for hidden layers, for input.
4. Batch Normalization
Normalize inputs to each layer to have zero mean and unit variance.
Transform:
where are mini-batch mean/variance, are learned parameters.
Benefits:
- Reduces internal covariate shift.
- Allows higher learning rates.
- Acts as regularizer (adds noise).
- Reduces dependence on initialization.
At Test Time: Use running averages of from training.
Variants: Layer Norm (normalize across features), Group Norm, Instance Norm.
5. Early Stopping
Stop training when validation error starts increasing.
Patience: Allow epochs of no improvement before stopping.
6. Data Augmentation
Artificially expand training set via transformations (rotations, crops, flips, noise, etc.).
Optimization for Deep Learning
1. Gradient Descent Variants
See Optimization section for details on SGD, Momentum, Adam, etc.
2. Learning Rate Schedules
Step Decay: Reduce LR by factor every epochs.
Exponential Decay: .
1/t Decay: .
Cosine Annealing: .
Warmup: Gradually increase LR at start.
3. Gradient Clipping
Prevent exploding gradients (common in RNNs).
Norm Clipping: If , scale: .
Value Clipping: Clip each gradient element to .
Weight Initialization
Poor initialization can cause vanishing/exploding gradients or slow convergence.
1. Zero Initialization
Bad: All neurons learn the same features (symmetry problem).
2. Small Random Values
. Can work for shallow networks but causes vanishing activations/gradients in deep networks.
3. Xavier/Glorot Initialization
For Sigmoid/Tanh activations. Maintains variance of activations across layers.
or Uniform:
4. He Initialization
For ReLU activations. Accounts for fact that half of neurons are inactive.
5. Bias Initialization
Typically initialize to zero. For ReLU, can use small positive value (e.g., 0.01) to ensure neurons are initially active.
Convolutional Neural Networks (CNNs)
1. Convolution Layer
Apply filters (kernels) to extract spatial features.
Operation:
Hyperparameters:
- Filter Size: (typically 3x3 or 5x5).
- Stride: Step size (1 = dense, 2 = skip every other).
- Padding: Add zeros around border (same = preserve size, valid = no padding).
- Number of Filters: Determines depth of output.
Output Size:
where = input size, = padding, = kernel size, = stride.
Parameters: (much fewer than fully connected).
2. Pooling Layer
Downsample to reduce spatial dimensions.
Max Pooling: Take max in each window (most common).
Average Pooling: Take average.
Global Average Pooling: Average entire feature map to single value (used before final FC layer).
Effect: Translation invariance, reduces overfitting, fewer parameters.
3. Architecture Patterns
- Conv-ReLU-Pool (repeat)
- Conv-ReLU-Conv-ReLU-Pool (repeat, deeper)
Famous Architectures: LeNet, AlexNet, VGGNet, GoogLeNet (Inception), ResNet, EfficientNet.
Recurrent Neural Networks (RNNs)
1. Vanilla RNN
Process sequences by maintaining hidden state.
Forward Pass:
Backpropagation Through Time (BPTT): Unroll network through time, apply backprop.
Problem: Vanishing/exploding gradients over long sequences.
2. Long Short-Term Memory (LSTM)
Addresses vanishing gradient via gating mechanisms and cell state.
Gates:
- Forget Gate: (what to forget from cell state)
- Input Gate: (what new info to add)
- Candidate:
- Output Gate: (what to output)
Updates:
Key Idea: Cell state provides highway for gradients to flow through time.
3. Gated Recurrent Unit (GRU)
Simpler alternative to LSTM with fewer parameters.
Gates:
- Reset Gate:
- Update Gate:
- Candidate:
Update:
Comparison: GRU has fewer parameters, LSTM slightly more expressive. Performance often similar.
Attention and Transformers
1. Attention Mechanism
Allows model to focus on relevant parts of input.
Scaled Dot-Product Attention:
where (queries), (keys), (values) are linear projections of input.
Interpretation: Compute similarity (dot product) between queries and keys, use as weights for values.
2. Self-Attention
Attention where all come from same input. Captures dependencies within sequence.
3. Multi-Head Attention
Run multiple attention mechanisms in parallel, concatenate outputs.
where .
Benefits: Model different types of relationships simultaneously.
4. Transformer Architecture
Encoder-Decoder architecture based entirely on attention (no recurrence).
Encoder: Stack of (Multi-Head Self-Attention + FFN) layers with residual connections and layer norm.
Decoder: Stack of (Masked Multi-Head Self-Attention + Multi-Head Cross-Attention + FFN) layers.
Positional Encoding: Add position information since no recurrence:
Advantages: Parallelizable (unlike RNNs), long-range dependencies, state-of-the-art in NLP.
Famous Models: BERT, GPT, T5, Vision Transformer (ViT).
Advanced Topics
1. Residual Connections (ResNets)
Skip connections that add input to output: .
Benefits: Solves vanishing gradient problem, enables very deep networks (100+ layers).
Identity Mapping: If , layer performs identity, so easy to train.
2. Transfer Learning
Use pre-trained model on new task.
Approaches:
- Feature Extraction: Freeze early layers, train only final layers.
- Fine-Tuning: Unfreeze some layers, train with small LR.
Benefits: Faster training, better performance with limited data.
3. Generative Adversarial Networks (GANs)
Two networks compete: Generator creates fake data, Discriminator distinguishes real from fake.
Objective:
Training: Alternate updating and .
Challenges: Mode collapse, training instability.
Variants: DCGAN, Wasserstein GAN, StyleGAN.
4. Variational Autoencoders (VAEs)
Learns probabilistic latent representation.
Encoder: approximates posterior. Decoder: reconstructs input.
Loss (ELBO):
Reconstruction term + KL regularization (enforce latent prior, typically ).
Reparameterization Trick: where allows backprop through sampling.
5. Neural Architecture Search (NAS)
Automate design of neural network architectures.
Approaches: Reinforcement learning, evolutionary algorithms, gradient-based (DARTS).
6. Knowledge Distillation
Train small "student" network to mimic large "teacher" network.
Loss: Minimize KL divergence between student and teacher outputs (softened with temperature).
Benefits: Model compression, faster inference.
Pseudocode: Neural Network Training
# Simple 2-layer NN
def train(X, y, hidden_size=64, lr=0.01, epochs=100):
W1 = randn(input_dim, hidden_size) * sqrt(2/input_dim) # He init
b1 = zeros(hidden_size)
W2 = randn(hidden_size, output_dim) * sqrt(2/hidden_size)
b2 = zeros(output_dim)
for epoch in range(epochs):
# Forward
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_pred = softmax(z2)
# Loss
loss = cross_entropy(y, y_pred)
# Backward (Gradients)
dz2 = y_pred - y # Gradient of CE + Softmax
dW2 = a1.T @ dz2 / batch_size
db2 = sum(dz2, axis=0) / batch_size
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0) # ReLU derivative
dW1 = X.T @ dz1 / batch_size
db1 = sum(dz1, axis=0) / batch_size
# Update (SGD with momentum can be added)
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
if epoch % 10 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
return W1, b1, W2, b2