Optimizer
What is an Optimizer?
An optimizer is an algorithm that adjusts the parameters of a machine learning model to minimize the loss function during training. Optimizers implement the core learning mechanism by determining how model parameters should be updated based on the computed gradients, effectively guiding the model toward optimal performance.
Key Characteristics
- Parameter Update: Adjusts model weights based on gradients
- Learning Rate Control: Manages step size during optimization
- Convergence Acceleration: Speeds up training process
- Local Minima Escape: Helps avoid suboptimal solutions
- Adaptive Learning: Adjusts to different parameter scales
- Memory Efficiency: Balances computational resources
- Hyperparameter Sensitivity: Requires careful tuning
How Optimizers Work
- Initialization: Set initial model parameters
- Forward Pass: Compute predictions and loss
- Backward Pass: Calculate gradients via backpropagation
- Parameter Update: Adjust parameters using optimizer rules
- Iteration: Repeat until convergence or stopping criteria
- Evaluation: Assess model performance on validation data
Optimization Process Diagram
Start Training
│
▼
Initialize Parameters
│
▼
Forward Pass → Compute Loss
│
▼
Backward Pass → Compute Gradients
│
▼
Optimizer Update → Adjust Parameters
│
├── Converged?
│ ├── Yes → Stop Training
│ │
│ └── No → Continue Training
│
▼
Evaluate on Validation Data
│
▼
Repeat
Common Optimizers
Gradient Descent Variants
Stochastic Gradient Descent (SGD)
- Formula: $ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t) $
- Characteristics: Simple, noisy updates, requires learning rate tuning
- Use Case: General optimization, convex problems
- Pros: Simple, memory efficient
- Cons: Slow convergence, sensitive to learning rate
SGD with Momentum
- Formula: $$ v_{t+1} = \gamma v_t + \eta \nabla_\theta \mathcal{L}(\theta_t) $$ $$ \theta_{t+1} = \theta_t - v_{t+1} $$
- Characteristics: Accelerates convergence, reduces oscillations
- Use Case: Deep learning, non-convex problems
- Pros: Faster convergence, escapes local minima
- Cons: Additional hyperparameter (momentum)
Adaptive Optimizers
AdaGrad (Adaptive Gradient Algorithm)
- Formula: $$ G_ = G_ + \nabla_\theta \mathcal{L}(\theta_t)^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_\theta \mathcal{L}(\theta_t) $$
- Characteristics: Adapts learning rate per parameter
- Use Case: Sparse data, natural language processing
- Pros: Automatic learning rate adaptation
- Cons: Learning rate decays too aggressively
RMSProp (Root Mean Square Propagation)
- Formula: $$ Eg^2t = \beta Eg^2 + (1 - \beta) \nabla_\theta \mathcal{L}(\theta_t)^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{Eg^2t + \epsilon}} \nabla\theta \mathcal{L}(\theta_t) $$
- Characteristics: Exponentially weighted moving average
- Use Case: Recurrent neural networks, non-convex problems
- Pros: Handles non-stationary objectives
- Cons: Requires tuning of decay rate
Adam (Adaptive Moment Estimation)
- Formula: $$ m_t = \beta_1 m_ + (1 - \beta_1) \nabla_\theta \mathcal{L}(\theta_t) $$ $$ v_t = \beta_2 v_ + (1 - \beta_2) \nabla_\theta \mathcal{L}(\theta_t)^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $$ $$ \hat{v}t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$
- Characteristics: Combines momentum and adaptive learning rates
- Use Case: Most deep learning applications
- Pros: Works well in practice, adaptive learning rates
- Cons: Can converge to suboptimal solutions
Advanced Optimizers
AdamW (Adam with Weight Decay)
- Characteristics: Fixes weight decay implementation in Adam
- Use Case: Modern deep learning architectures
- Pros: Better regularization, improved generalization
- Cons: Slightly more complex than Adam
NADAM (Nesterov-accelerated Adam)
- Characteristics: Combines Nesterov momentum with Adam
- Use Case: Deep learning with momentum
- Pros: Faster convergence than Adam
- Cons: More complex implementation
AMSGrad
- Characteristics: Variant of Adam that maintains maximum of past squared gradients
- Use Case: Problems where Adam fails to converge
- Pros: Theoretical convergence guarantees
- Cons: Slower than Adam in practice
Lion (Evolved Sign Momentum)
- Formula: $$ m_t = \beta_1 m_ + (1 - \beta_1) \nabla_\theta \mathcal{L}(\theta_t) $$ $$ \theta_{t+1} = \theta_t - \eta (\text{sign}(m_t) + \lambda \theta_t) $$
- Characteristics: Sign-based updates with weight decay
- Use Case: Large-scale deep learning
- Pros: Memory efficient, good generalization
- Cons: Newer, less established
Mathematical Foundations
Optimization Objective
Optimizers solve the optimization problem:
$$ \theta^* = \arg\min_\theta \mathcal{L}(\theta) = \arg\min_\theta \frac{1}{n} \sum_^n L(y_i, f(x_i; \theta)) $$
Gradient Descent
Basic gradient descent update:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t) $$
where $\eta$ is the learning rate.
Momentum
Momentum helps accelerate gradients in consistent directions:
$$ v_{t+1} = \gamma v_t + \eta \nabla_\theta \mathcal{L}(\theta_t) $$ $$ \theta_{t+1} = \theta_t - v_{t+1} $$
Adaptive Learning Rates
Adaptive methods adjust learning rates per parameter:
$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii}} + \epsilon} \nabla_{\theta_i} \mathcal{L}(\theta_t) $$
where $G_t$ is a diagonal matrix of past squared gradients.
Optimizer Implementation
Python Examples
Stochastic Gradient Descent
import numpy as np
class SGD:
def __init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
def update(self, params, grads):
for key in params.keys():
params[key] -= self.learning_rate * grads[key]
# Example usage
params = {'W1': np.random.randn(10, 5), 'b1': np.zeros(5)}
grads = {'W1': np.random.randn(10, 5), 'b1': np.random.randn(5)}
optimizer = SGD(learning_rate=0.01)
optimizer.update(params, grads)
SGD with Momentum
class Momentum:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum * self.v[key] - self.learning_rate * grads[key]
params[key] += self.v[key]
AdaGrad
class AdaGrad:
def __init__(self, learning_rate=0.01, epsilon=1e-8):
self.learning_rate = learning_rate
self.epsilon = epsilon
self.G = None
def update(self, params, grads):
if self.G is None:
self.G = {}
for key, val in params.items():
self.G[key] = np.zeros_like(val)
for key in params.keys():
self.G[key] += grads[key] ** 2
params[key] -= self.learning_rate * grads[key] / (np.sqrt(self.G[key]) + self.epsilon)
RMSProp
class RMSProp:
def __init__(self, learning_rate=0.001, decay_rate=0.99, epsilon=1e-8):
self.learning_rate = learning_rate
self.decay_rate = decay_rate
self.epsilon = epsilon
self.E = None
def update(self, params, grads):
if self.E is None:
self.E = {}
for key, val in params.items():
self.E[key] = np.zeros_like(val)
for key in params.keys():
self.E[key] = self.decay_rate * self.E[key] + (1 - self.decay_rate) * grads[key] ** 2
params[key] -= self.learning_rate * grads[key] / (np.sqrt(self.E[key]) + self.epsilon)
Adam
class Adam:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None
self.v = None
self.t = 0
def update(self, params, grads):
if self.m is None:
self.m = {}
self.v = {}
for key, val in params.items():
self.m[key] = np.zeros_like(val)
self.v[key] = np.zeros_like(val)
self.t += 1
for key in params.keys():
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * grads[key] ** 2
m_hat = self.m[key] / (1 - self.beta1 ** self.t)
v_hat = self.v[key] / (1 - self.beta2 ** self.t)
params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
TensorFlow/Keras Implementation
import tensorflow as tf
from tensorflow.keras.optimizers import SGD, Adam, RMSprop, Adagrad, Nadam, AdamW
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Stochastic Gradient Descent
sgd = SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
model_sgd = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_sgd.compile(optimizer=sgd, loss='mse')
# Adam optimizer
adam = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
model_adam = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_adam.compile(optimizer=adam, loss='mse')
# RMSProp optimizer
rmsprop = RMSprop(learning_rate=0.001, rho=0.9, epsilon=1e-7)
model_rmsprop = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_rmsprop.compile(optimizer=rmsprop, loss='mse')
# AdaGrad optimizer
adagrad = Adagrad(learning_rate=0.01, epsilon=1e-7)
model_adagrad = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_adagrad.compile(optimizer=adagrad, loss='mse')
# Nadam optimizer
nadam = Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
model_nadam = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_nadam.compile(optimizer=nadam, loss='mse')
# AdamW optimizer
adamw = AdamW(learning_rate=0.001, weight_decay=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-7)
model_adamw = Sequential([Dense(10, input_dim=5, activation='relu'), Dense(1)])
model_adamw.compile(optimizer=adamw, loss='mse')
# Custom optimizer
class Lion(tf.keras.optimizers.Optimizer):
def __init__(self, learning_rate=0.0001, beta1=0.9, beta2=0.99, weight_decay=0.0, name='Lion', **kwargs):
super().__init__(name, **kwargs)
self._set_hyper('learning_rate', kwargs.get('lr', learning_rate))
self._set_hyper('beta1', beta1)
self._set_hyper('beta2', beta2)
self.weight_decay = weight_decay
def _create_slots(self, var_list):
for var in var_list:
self.add_slot(var, 'm')
def _resource_apply_dense(self, grad, var, apply_state=None):
lr = self._get_hyper('learning_rate')
beta1 = self._get_hyper('beta1')
beta2 = self._get_hyper('beta2')
m = self.get_slot(var, 'm')
m_t = beta1 * m + (1 - beta1) * grad
var_update = var - lr * (tf.sign(m_t) + self.weight_decay * var)
m_update = m_t
return tf.group(*[var.assign(var_update), m.assign(m_update)])
def get_config(self):
config = super().get_config()
config.update({
'learning_rate': self._serialize_hyperparameter('learning_rate'),
'beta1': self._serialize_hyperparameter('beta1'),
'beta2': self._serialize_hyperparameter('beta2'),
'weight_decay': self.weight_decay,
})
return config
PyTorch Implementation
import torch
import torch.optim as optim
# Stochastic Gradient Descent
sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
# Adam optimizer
adam = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
# RMSProp optimizer
rmsprop = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, eps=1e-8)
# AdaGrad optimizer
adagrad = optim.Adagrad(model.parameters(), lr=0.01, eps=1e-8)
# AdamW optimizer
adamw = optim.AdamW(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01)
# Custom optimizer example
class Lion(optim.Optimizer):
def __init__(self, params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0):
defaults = dict(lr=lr, betas=betas, weight_decay=weight_decay)
super().__init__(params, defaults)
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Lion does not support sparse gradients')
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
state['m'] = torch.zeros_like(p.data)
m = state['m']
beta1, beta2 = group['betas']
state['step'] += 1
# Update momentum
m.mul_(beta1).add_(grad, alpha=1 - beta1)
# Update parameters
update = m.sign() + group['weight_decay'] * p.data
p.data.add_(update, alpha=-group['lr'])
return loss
Optimizer Selection Guide
Comparison of Optimizers
| Optimizer | Learning Rate Adaptation | Momentum | Memory Usage | Convergence Speed | Best For | Hyperparameters |
|---|---|---|---|---|---|---|
| SGD | No | Optional | Low | Slow | Convex problems, simple models | lr, momentum |
| Momentum | No | Yes | Medium | Medium | Deep learning, non-convex problems | lr, momentum |
| AdaGrad | Yes (per parameter) | No | Medium | Medium | Sparse data, NLP | lr, epsilon |
| RMSProp | Yes (per parameter) | No | Medium | Fast | RNNs, non-convex problems | lr, rho, epsilon |
| Adam | Yes (per parameter) | Yes | High | Fast | Most deep learning tasks | lr, beta1, beta2 |
| AdamW | Yes (per parameter) | Yes | High | Fast | Modern deep learning | lr, beta1, beta2, wd |
| NAdam | Yes (per parameter) | Yes | High | Fast | Deep learning with momentum | lr, beta1, beta2 |
| AMSGrad | Yes (per parameter) | Yes | High | Medium | Problems where Adam fails | lr, beta1, beta2 |
| Lion | No (sign-based) | Yes | Low | Fast | Large-scale models | lr, beta1, beta2 |
Optimizer Selection by Problem Type
Computer Vision
# For CNN-based models
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
# For large vision models
optimizer = AdamW(learning_rate=0.0001, weight_decay=0.01)
# For vision transformers
optimizer = Lion(learning_rate=0.0001, beta1=0.9, beta2=0.99)
Natural Language Processing
# For transformer models
optimizer = AdamW(learning_rate=0.0001, weight_decay=0.01, beta_1=0.9, beta_2=0.98)
# For RNNs/LSTMs
optimizer = RMSprop(learning_rate=0.001, rho=0.9)
# For large language models
optimizer = Lion(learning_rate=0.00005, beta1=0.9, beta2=0.99)
Reinforcement Learning
# For policy gradient methods
optimizer = Adam(learning_rate=0.0003, beta_1=0.9, beta_2=0.999)
# For Q-learning
optimizer = RMSprop(learning_rate=0.0005, rho=0.95)
# For actor-critic methods
optimizer = Adam(learning_rate=0.0001, eps=1e-5)
Tabular Data
# For gradient boosting machines
# (Note: Optimizers are typically built into the library)
# XGBoost example:
import xgboost as xgb
model = xgb.XGBClassifier(
learning_rate=0.1,
n_estimators=100,
max_depth=3
)
# For neural networks on tabular data
optimizer = Adam(learning_rate=0.001)
Optimizer Hyperparameters
Key Hyperparameters
| Hyperparameter | Typical Range | Effect | Notes |
|---|---|---|---|
| Learning Rate | 1e-5 to 1e-1 | Step size for parameter updates | Most critical parameter |
| Momentum | 0.8 to 0.99 | Accelerates convergence in consistent directions | For momentum-based optimizers |
| Beta1 | 0.8 to 0.99 | Exponential decay rate for first moment | Adam, AdamW, etc. |
| Beta2 | 0.9 to 0.999 | Exponential decay rate for second moment | Adam, AdamW, etc. |
| Epsilon | 1e-8 to 1e-6 | Numerical stability term | Prevents division by zero |
| Weight Decay | 0.0 to 0.1 | L2 regularization strength | AdamW, Lion |
| Rho | 0.8 to 0.99 | Decay rate for moving average | RMSProp |
Learning Rate Scheduling
Fixed Learning Rate
optimizer = Adam(learning_rate=0.001)
Step Decay
from tensorflow.keras.optimizers.schedules import PiecewiseConstantDecay
lr_schedule = PiecewiseConstantDecay(
boundaries=[10000, 20000],
values=[0.001, 0.0005, 0.0001]
)
optimizer = Adam(learning_rate=lr_schedule)
Exponential Decay
from tensorflow.keras.optimizers.schedules import ExponentialDecay
lr_schedule = ExponentialDecay(
initial_learning_rate=0.001,
decay_steps=10000,
decay_rate=0.9
)
optimizer = Adam(learning_rate=lr_schedule)
Cosine Decay
from tensorflow.keras.optimizers.schedules import CosineDecay
lr_schedule = CosineDecay(
initial_learning_rate=0.001,
decay_steps=10000
)
optimizer = Adam(learning_rate=lr_schedule)
Cyclical Learning Rates
from tensorflow.keras.optimizers.schedules import PiecewiseConstantDecay
# Triangular learning rate policy
max_lr = 0.001
min_lr = 0.0001
step_size = 2000
def triangular_lr(step):
cycle = np.floor(1 + step / (2 * step_size))
x = np.abs(step / step_size - 2 * cycle + 1)
lr = min_lr + (max_lr - min_lr) * max(0, 1 - x)
return lr
optimizer = Adam(learning_rate=triangular_lr)
PyTorch Learning Rate Schedulers
from torch.optim.lr_scheduler import StepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau
# Step decay
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Exponential decay
scheduler = ExponentialLR(optimizer, gamma=0.9)
# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0)
# Reduce on plateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
# Cyclical learning rate
scheduler = optim.lr_scheduler.CyclicLR(
optimizer,
base_lr=0.0001,
max_lr=0.001,
step_size_up=2000,
mode='triangular'
)
Optimizer Theory and Research
Theoretical Foundations
Convergence Analysis
For convex functions, gradient descent converges at rate:
$$ \mathcal{L}(\theta_t) - \mathcal{L}(\theta^) \leq \frac{|\theta_0 - \theta^|^2}{2\eta t} $$
For strongly convex functions, the rate improves to:
$$ \mathcal{L}(\theta_t) - \mathcal{L}(\theta^) \leq (1 - \eta \mu)^t |\theta_0 - \theta^|^2 $$
where $\mu$ is the strong convexity parameter.
Momentum Analysis
Momentum helps accelerate convergence in directions of persistent gradient:
$$ \theta_{t+1} = \theta_t - \eta \sum_^t \gamma^{t-k} \nabla \mathcal{L}(\theta_k) $$
Adaptive Methods
Adaptive methods like Adam can be viewed as combining:
- Momentum: First moment estimate
- RMSProp: Second moment estimate
- Bias Correction: For initialization
Key Research Papers
- "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014)
- Introduced Adam optimizer
- Combined momentum and adaptive learning rates
- Demonstrated effectiveness on various tasks
- "On the Variance of the Adaptive Learning Rate and Beyond" (Reddi et al., 2018)
- Identified convergence issues with Adam
- Proposed AMSGrad as a fix
- Theoretical analysis of adaptive methods
- "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, 2017)
- Introduced AdamW optimizer
- Fixed weight decay implementation in Adam
- Improved generalization performance
- "Symbolic Discovery of Optimization Algorithms" (Bello et al., 2022)
- Introduced Lion optimizer
- Used symbolic programming to discover new optimizers
- Demonstrated memory efficiency and good generalization
- "An overview of gradient descent optimization algorithms" (Ruder, 2016)
- Comprehensive survey of optimization algorithms
- Comparison of different approaches
- Practical recommendations
Optimizer Best Practices
When to Use Different Optimizers
SGD with Momentum
- Use When: Training simple models, convex problems
- Advantages: Simple, memory efficient
- Disadvantages: Slow convergence, sensitive to learning rate
Adam
- Use When: Most deep learning tasks
- Advantages: Works well out of the box, adaptive learning rates
- Disadvantages: Can converge to suboptimal solutions
AdamW
- Use When: Modern deep learning architectures
- Advantages: Better regularization, improved generalization
- Disadvantages: Slightly more complex than Adam
RMSProp
- Use When: Recurrent neural networks, non-convex problems
- Advantages: Handles non-stationary objectives
- Disadvantages: Requires tuning of decay rate
Lion
- Use When: Large-scale models, memory-constrained environments
- Advantages: Memory efficient, good generalization
- Disadvantages: Newer, less established
Optimizer Configuration Guidelines
| Model Type | Recommended Optimizer | Learning Rate | Momentum/Betas | Weight Decay | Notes |
|---|---|---|---|---|---|
| CNN (small) | Adam | 0.001 | (0.9, 0.999) | 0.0 | Default settings work well |
| CNN (large) | AdamW | 0.0001 | (0.9, 0.999) | 0.01 | Better regularization |
| Vision Transformer | AdamW or Lion | 0.0001 | (0.9, 0.999) | 0.05 | Higher weight decay |
| RNN/LSTM | RMSProp | 0.001 | rho=0.9 | 0.0 | Handles non-stationary data |
| Transformer (NLP) | AdamW | 0.0001 | (0.9, 0.98) | 0.01 | Different beta2 for NLP |
| Reinforcement Learning | Adam | 0.0003 | (0.9, 0.999) | 0.0 | Lower learning rate |
| GAN | Adam | 0.0002 | (0.5, 0.999) | 0.0 | Lower beta1 for GANs |
| Large Language Model | Lion or AdamW | 0.00005 | (0.9, 0.99) | 0.1 | Very low learning rate |
Learning Rate Selection
Learning Rate Finder
def find_learning_rate(model, train_data, loss_fn, optimizer, min_lr=1e-6, max_lr=1, num_iter=100):
"""Learning rate finder implementation"""
lr_schedule = np.logspace(np.log10(min_lr), np.log10(max_lr), num_iter)
losses = []
lrs = []
for i, lr in enumerate(lr_schedule):
# Update learning rate
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# Get batch
X_batch, y_batch = next(train_data)
# Forward pass
y_pred = model(X_batch)
loss = loss_fn(y_pred, y_batch)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Record
losses.append(loss.item())
lrs.append(lr)
# Early stopping if loss explodes
if i > 0 and loss.item() > 4 * losses[0]:
break
# Plot learning rate vs loss
plt.figure(figsize=(10, 6))
plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.show()
# Return optimal learning rate (minimum loss + factor of 10)
min_loss_idx = np.argmin(losses)
optimal_lr = lrs[min_loss_idx] / 10
return optimal_lr
Learning Rate Range Test
def lr_range_test(model, train_loader, optimizer, criterion, device, min_lr=1e-7, max_lr=10, num_iter=200):
"""Learning rate range test for PyTorch"""
model.train()
losses = []
log_lrs = []
# Linear scale for learning rates
lr_lambda = lambda x: min_lr * (max_lr / min_lr) ** (x / num_iter)
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
for i, (data, target) in enumerate(train_loader):
if i >= num_iter:
break
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
losses.append(loss.item())
log_lrs.append(np.log10(optimizer.param_groups[0]['lr']))
scheduler.step()
# Plot
plt.figure(figsize=(10, 6))
plt.plot(log_lrs, losses)
plt.xlabel('Learning Rate (log10)')
plt.ylabel('Loss')
plt.title('Learning Rate Range Test')
plt.show()
# Find optimal learning rate
min_loss = min(losses)
optimal_idx = losses.index(min_loss)
optimal_lr = 10 ** log_lrs[optimal_idx] / 10 # Divide by 10 for safety
return optimal_lr
Optimizer Challenges
Common Issues and Solutions
| Issue | Possible Cause | Solution |
|---|---|---|
| Slow convergence | Learning rate too small | Increase learning rate |
| Divergence | Learning rate too large | Decrease learning rate |
| Oscillations | Learning rate too large | Decrease learning rate or add momentum |
| Getting stuck in local minima | Poor initialization | Try different optimizer |
| Overfitting | Learning rate too large | Decrease learning rate or add regularization |
| Vanishing gradients | Poor optimizer choice | Use adaptive optimizers |
| Exploding gradients | Learning rate too large | Gradient clipping |
| Poor generalization | Over-optimization | Early stopping |
| Memory issues | Large batch size | Use memory-efficient optimizers |
Debugging Optimizers
class OptimizerMonitor(Callback):
def __init__(self, model, X_val, y_val, loss_fn):
super().__init__()
self.model = model
self.X_val = X_val
self.y_val = y_val
self.loss_fn = loss_fn
self.loss_history = []
self.lr_history = []
self.grad_norms = []
def on_epoch_end(self, epoch, logs=None):
# Record validation loss
y_pred = self.model.predict(self.X_val)
loss = self.loss_fn(self.y_val, y_pred)
self.loss_history.append(loss)
# Record learning rate
current_lr = self.model.optimizer.learning_rate.numpy()
self.lr_history.append(current_lr)
# Record gradient norms (for first layer)
grads = self.model.optimizer.get_gradients(self.model.total_loss, self.model.trainable_variables)
if grads:
grad_norm = tf.norm(grads[0]).numpy()
self.grad_norms.append(grad_norm)
# Plot diagnostics
if epoch % 10 == 0:
self.plot_diagnostics()
def plot_diagnostics(self):
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
# Loss history
ax1.plot(self.loss_history)
ax1.set_title('Validation Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
# Learning rate history
ax2.plot(self.lr_history)
ax2.set_title('Learning Rate')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Learning Rate')
ax2.set_yscale('log')
# Gradient norms
if self.grad_norms:
ax3.plot(self.grad_norms)
ax3.set_title('Gradient Norm (First Layer)')
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Gradient Norm')
ax3.set_yscale('log')
plt.tight_layout()
plt.show()
# Example usage
monitor = OptimizerMonitor(model, X_val, y_val, mean_squared_error)
model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val), callbacks=[monitor])
Optimizer in Practice
Optimizer for Different Tasks
Image Classification
# For CNN models
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Adam optimizer with learning rate scheduling
initial_learning_rate = 0.001
lr_schedule = ExponentialDecay(
initial_learning_rate,
decay_steps=10000,
decay_rate=0.9,
staircase=True
)
optimizer = Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
Natural Language Processing
from transformers import AdamW
# For transformer models
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
# AdamW optimizer with weight decay
optimizer = AdamW(
learning_rate=2e-5,
weight_decay=0.01,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8
)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Reinforcement Learning
# For DQN
model = Sequential([
Dense(64, activation='relu', input_dim=state_dim),
Dense(64, activation='relu'),
Dense(action_dim, activation='linear')
])
# RMSProp optimizer
optimizer = RMSprop(learning_rate=0.00025, rho=0.95, epsilon=1e-6)
model.compile(optimizer=optimizer, loss='mse')
Generative Adversarial Networks
# Generator
generator = Sequential([
Dense(256, activation='relu', input_dim=latent_dim),
BatchNormalization(),
Dense(512, activation='relu'),
BatchNormalization(),
Dense(1024, activation='relu'),
BatchNormalization(),
Dense(img_dim, activation='tanh')
])
# Discriminator
discriminator = Sequential([
Dense(512, activation='relu', input_dim=img_dim),
Dense(256, activation='relu'),
Dense(1, activation='sigmoid')
])
# Adam optimizer with lower beta1 for GANs
optimizer = Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.999)
Optimizer Workflow
- Problem Analysis: Understand the problem and model architecture
- Optimizer Selection: Choose appropriate optimizer based on problem type
- Hyperparameter Initialization: Set learning rate, momentum, etc.
- Learning Rate Scheduling: Configure learning rate schedule
- Training: Train model with selected optimizer
- Monitoring: Track loss, gradients, and other metrics
- Diagnosis: Analyze convergence and performance
- Iteration: Adjust optimizer parameters if needed
- Final Model: Train with optimal optimizer configuration
Optimizer and Model Architecture
- Convolutional Neural Networks: Adam, AdamW, or SGD with momentum
- Recurrent Neural Networks: RMSProp or Adam
- Transformers: AdamW with custom learning rate schedules
- Graph Neural Networks: Adam or AdamW
- Autoencoders: Adam or RMSProp
- Generative Models: Adam with lower beta1 for GANs
Future Directions
- Automated Optimizer Selection: AutoML for optimizer configuration
- Neural Optimizers: Optimizers parameterized by neural networks
- Meta-Learning Optimizers: Optimizers that learn to optimize
- Memory-Efficient Optimizers: Optimizers for large-scale models
- Quantum Optimizers: Optimizers for quantum machine learning
- Explainable Optimizers: Interpretable optimization processes
- Adaptive Optimizer Ensembles: Combining multiple optimizers
- Optimizer-Aware Architecture Search: Co-design of models and optimizers
External Resources
- Adam: A Method for Stochastic Optimization (arXiv)
- Decoupled Weight Decay Regularization (arXiv)
- Symbolic Discovery of Optimization Algorithms (arXiv)
- An overview of gradient descent optimization algorithms (arXiv)
- Keras Optimizers Documentation
- PyTorch Optimizers Documentation
- Optimizers in TensorFlow Documentation
- Deep Learning Book - Optimization Chapter
- Practical Recommendations for Gradient-Based Training (arXiv)