Learning Rate
What is Learning Rate?
The learning rate is a hyperparameter that determines the step size at which an optimizer adjusts the parameters of a machine learning model during training. It controls how much the model weights are updated in response to the estimated error each time the model parameters are updated, directly influencing the speed and quality of learning.
Key Characteristics
- Step Size Control: Determines magnitude of parameter updates
- Convergence Speed: Affects how quickly model learns
- Optimization Stability: Balances between fast learning and overshooting
- Hyperparameter: Must be set before training begins
- Problem-Specific: Optimal value depends on model and data
- Dynamic Adjustment: Can be fixed or adaptive during training
- Critical Impact: One of the most important hyperparameters
How Learning Rate Works
- Gradient Calculation: Compute gradient of loss function
- Parameter Update: Adjust weights based on gradient
- Step Size Application: Scale update by learning rate
- Iteration: Repeat until convergence or stopping criteria
Learning Rate Process Diagram
Compute Gradient → Scale by Learning Rate → Update Parameters → Check Convergence
↑ ↓
└──────────────────────────────────────────────────────────┘
Mathematical Foundations
Basic Gradient Descent Update
The fundamental update rule with learning rate $\eta$:
$$ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t) $$
where:
- $\theta_t$ are model parameters at time $t$
- $\eta$ is the learning rate
- $\nabla_\theta \mathcal{L}(\theta_t)$ is the gradient of the loss function
Learning Rate and Convergence
For convex functions, the convergence rate depends on learning rate:
$$ \mathcal{L}(\theta_t) - \mathcal{L}(\theta^) \leq \frac{|\theta_0 - \theta^|^2}{2\eta t} $$
where $\theta^*$ is the optimal solution.
Learning Rate in Adaptive Optimizers
Adaptive optimizers like Adam use per-parameter learning rates:
$$ \theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{\hat{v}{t, i}} + \epsilon} \hat{m}{t, i} $$
where $\hat{v}_{t, i}$ is the estimated second moment of gradients.
Learning Rate Values and Effects
Learning Rate Spectrum
| Learning Rate Range | Effect on Training | Typical Outcome |
|---|---|---|
| Too small (<1e-6) | Extremely slow convergence | May never reach optimal solution |
| Small (1e-5 to 1e-3) | Slow but stable convergence | Good final performance, slow training |
| Optimal (1e-3 to 1e-1) | Balanced convergence speed and stability | Best trade-off |
| Large (1e-1 to 1) | Fast initial progress, unstable training | May overshoot, oscillate, diverge |
| Too large (>1) | Immediate divergence | Training fails completely |
Visualization of Learning Rate Effects
import matplotlib.pyplot as plt
import numpy as np
def loss_landscape(x):
return x**4 - 4*x**2 + 5 # Simple non-convex function
def gradient(x):
return 4*x**3 - 8*x
x = np.linspace(-2.5, 2.5, 100)
plt.figure(figsize=(12, 8))
# Plot loss landscape
plt.subplot(2, 2, 1)
plt.plot(x, loss_landscape(x))
plt.title('Loss Landscape')
plt.xlabel('Parameter Value')
plt.ylabel('Loss')
# Different learning rates
learning_rates = [0.01, 0.1, 0.2, 0.5]
start = 2.0
for i, lr in enumerate(learning_rates):
plt.subplot(2, 2, i+2)
current = start
trajectory = [current]
for _ in range(20):
current = current - lr * gradient(current)
trajectory.append(current)
plt.plot(x, loss_landscape(x))
plt.scatter(trajectory, loss_landscape(np.array(trajectory)), c=range(len(trajectory)), cmap='viridis')
plt.plot(trajectory, loss_landscape(np.array(trajectory)), 'r--')
plt.title(f'LR = {lr}')
plt.xlabel('Parameter Value')
plt.ylabel('Loss')
plt.tight_layout()
plt.show()
Learning Rate Selection Strategies
Fixed Learning Rate
# Simple fixed learning rate
optimizer = Adam(learning_rate=0.001)
Learning Rate Schedules
Step Decay
from tensorflow.keras.optimizers.schedules import PiecewiseConstantDecay
lr_schedule = PiecewiseConstantDecay(
boundaries=[10000, 20000],
values=[0.001, 0.0005, 0.0001]
)
optimizer = Adam(learning_rate=lr_schedule)
Exponential Decay
from tensorflow.keras.optimizers.schedules import ExponentialDecay
lr_schedule = ExponentialDecay(
initial_learning_rate=0.001,
decay_steps=10000,
decay_rate=0.9
)
optimizer = Adam(learning_rate=lr_schedule)
Cosine Decay
from tensorflow.keras.optimizers.schedules import CosineDecay
lr_schedule = CosineDecay(
initial_learning_rate=0.001,
decay_steps=10000
)
optimizer = Adam(learning_rate=lr_schedule)
Cyclical Learning Rates
from tensorflow.keras.optimizers.schedules import PiecewiseConstantDecay
max_lr = 0.001
min_lr = 0.0001
step_size = 2000
def triangular_lr(step):
cycle = np.floor(1 + step / (2 * step_size))
x = np.abs(step / step_size - 2 * cycle + 1)
lr = min_lr + (max_lr - min_lr) * max(0, 1 - x)
return lr
optimizer = Adam(learning_rate=triangular_lr)
Adaptive Learning Rates
Optimizer-Based Adaptation
# Adam automatically adapts learning rates per parameter
optimizer = Adam(learning_rate=0.001) # Initial learning rate
# RMSProp also adapts learning rates
optimizer = RMSprop(learning_rate=0.001)
Learning Rate Finder
def find_learning_rate(model, train_loader, optimizer, criterion, device, min_lr=1e-7, max_lr=10, num_iter=100):
"""Learning rate range test"""
model.train()
losses = []
log_lrs = []
# Linear scale for learning rates
lr_lambda = lambda x: min_lr * (max_lr / min_lr) ** (x / num_iter)
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
for i, (data, target) in enumerate(train_loader):
if i >= num_iter:
break
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
losses.append(loss.item())
log_lrs.append(np.log10(optimizer.param_groups[0]['lr']))
scheduler.step()
# Find optimal learning rate
min_loss = min(losses)
optimal_idx = losses.index(min_loss)
optimal_lr = 10 ** log_lrs[optimal_idx] / 10 # Divide by 10 for safety
# Plot
plt.figure(figsize=(10, 6))
plt.plot(log_lrs, losses)
plt.xlabel('Learning Rate (log10)')
plt.ylabel('Loss')
plt.title('Learning Rate Range Test')
plt.show()
return optimal_lr
Learning Rate for Different Models
Computer Vision
# CNN models
initial_learning_rate = 0.001
lr_schedule = ExponentialDecay(
initial_learning_rate,
decay_steps=10000,
decay_rate=0.9,
staircase=True
)
optimizer = Adam(learning_rate=lr_schedule)
# Vision Transformers
optimizer = AdamW(learning_rate=0.0001, weight_decay=0.05)
Natural Language Processing
# Transformer models
optimizer = AdamW(
learning_rate=2e-5,
weight_decay=0.01,
beta_1=0.9,
beta_2=0.98 # Different beta2 for NLP
)
# Custom learning rate schedule for transformers
def transformer_lr_schedule(step, warmup_steps=4000, model_size=512):
# Linear warmup followed by square root decay
if step < warmup_steps:
return float(step) / float(max(1, warmup_steps))
return max(0.0, (model_size ** (-0.5)) * min(step ** (-0.5), step * warmup_steps ** (-1.5)))
Reinforcement Learning
# Policy gradient methods
optimizer = Adam(learning_rate=0.0003)
# Q-learning
optimizer = RMSprop(learning_rate=0.0005, rho=0.95)
Tabular Data
# Gradient boosting machines
import xgboost as xgb
model = xgb.XGBClassifier(learning_rate=0.1) # Typically higher for GBMs
# Neural networks for tabular data
optimizer = Adam(learning_rate=0.001)
Learning Rate Best Practices
Selection Guidelines
| Model Type | Recommended Learning Rate | Schedule Type | Notes |
|---|---|---|---|
| CNN | 0.001 - 0.0001 | Exponential decay | Start with 0.001 |
| Vision Transformer | 0.0001 - 0.00001 | Cosine decay | Lower learning rates |
| RNN/LSTM | 0.001 - 0.0001 | Fixed or step decay | RMSProp often works better |
| Transformer (NLP) | 2e-5 - 5e-5 | Warmup + decay | Custom schedules common |
| GAN | 0.0002 | Fixed | Lower than standard |
| Reinforcement Learning | 0.0003 - 0.0001 | Fixed | Lower learning rates |
| Gradient Boosting | 0.01 - 0.3 | Fixed | Higher than neural networks |
Common Pitfalls and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Slow convergence | Learning rate too small | Increase learning rate |
| Divergence | Learning rate too large | Decrease learning rate |
| Oscillations | Learning rate too large | Decrease learning rate or add momentum |
| Getting stuck in local minima | Learning rate too small | Increase learning rate or use schedule |
| Overfitting | Learning rate too large | Decrease learning rate or add regularization |
| Poor generalization | Learning rate schedule issues | Use proper learning rate scheduling |
| Training instability | Learning rate too large | Use gradient clipping |
Learning Rate and Batch Size
The relationship between learning rate and batch size is important:
- Linear Scaling Rule: When increasing batch size by factor $k$, increase learning rate by same factor
- Square Root Scaling: Some evidence suggests square root scaling works better
- Practical Guideline: For batch size $B$, try learning rate $\eta \times B / 256$
base_lr = 0.001
base_batch_size = 32
current_batch_size = 256
# Linear scaling
scaled_lr = base_lr * current_batch_size / base_batch_size
optimizer = Adam(learning_rate=scaled_lr)
Learning Rate Research
Key Research Papers
- "Cyclical Learning Rates for Training Neural Networks" (Smith, 2017)
- Introduced cyclical learning rates
- Demonstrated improved training speed and performance
- Proposed learning rate range test
- "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" (Smith & Topin, 2019)
- Showed that very large learning rates can work with proper scheduling
- Introduced 1cycle policy
- Demonstrated faster convergence
- "A Disciplined Approach to Neural Network Hyperparameters" (Smith, 2018)
- Comprehensive guidelines for hyperparameter selection
- Emphasized learning rate importance
- Practical recommendations
- "The Marginal Value of Adaptive Gradient Methods in Machine Learning" (Wilson et al., 2017)
- Compared adaptive vs non-adaptive methods
- Showed adaptive methods can generalize worse
- Recommended SGD with momentum for many tasks
- "Large Batch Training of Convolutional Networks" (Goyal et al., 2017)
- Demonstrated linear scaling rule
- Showed how to train with large batches
- Practical guidelines for distributed training
Learning Rate in Practice
Learning Rate Workflow
- Initial Selection: Choose based on model type and experience
- Range Testing: Perform learning rate range test
- Schedule Selection: Choose appropriate schedule
- Training: Train model with selected learning rate
- Monitoring: Track loss and metrics
- Adjustment: Modify learning rate if needed
- Final Training: Train with optimal configuration
Learning Rate and Regularization
Learning rate interacts with regularization techniques:
- Weight Decay: Higher learning rates may require stronger weight decay
- Dropout: Can allow higher learning rates
- Early Stopping: Complements learning rate scheduling
- Batch Normalization: Allows higher learning rates
# Learning rate with weight decay
optimizer = AdamW(learning_rate=0.001, weight_decay=0.01)
# Learning rate with dropout
model = Sequential([
Dense(128, activation='relu'),
Dropout(0.5), # Allows higher learning rate
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
optimizer = Adam(learning_rate=0.001)
Learning Rate and Model Architecture
Different architectures respond differently to learning rates:
- Deep Networks: Often require lower learning rates
- Wide Networks: Can tolerate higher learning rates
- Residual Networks: Benefit from higher learning rates
- Transformers: Require custom learning rate schedules
- RNNs: Often need lower learning rates
Future Directions
- Automated Learning Rate Selection: AutoML for optimal learning rate discovery
- Neural Learning Rate Schedules: Learning rate schedules parameterized by neural networks
- Adaptive Learning Rate Ensembles: Combining multiple learning rate strategies
- Learning Rate Transfer: Transferring optimal learning rates across tasks
- Quantum Learning Rates: Learning rates for quantum machine learning
- Explainable Learning Rates: Interpretable learning rate strategies
- Learning Rate-Aware Architecture Search: Co-design of models and learning rates
External Resources
- Cyclical Learning Rates for Training Neural Networks (arXiv)
- Super-Convergence: Very Fast Training of Neural Networks (arXiv)
- A Disciplined Approach to Neural Network Hyperparameters (arXiv)
- The Marginal Value of Adaptive Gradient Methods (arXiv)
- Keras Learning Rate Schedules Documentation
- PyTorch Learning Rate Schedulers Documentation
- Learning Rate Finder Implementation
- Fast.ai Learning Rate Lesson
Large Language Models
Advanced AI systems trained on vast amounts of text data to understand, generate, and manipulate human language with remarkable accuracy and versatility.
Loss Function
Mathematical function that quantifies the difference between predicted and actual values in machine learning models.