๐ฏ Learning Rate โ The gas pedal of your neural network! ๐๐จ
๐ Definition
Learning Rate = the speed at which your neural network learns! Too slow = takes forever. Too fast = crashes into walls and explodes! It's the most critical hyperparameter that controls how big the steps are during gradient descent.
Principle:
- Step size: how much to adjust weights after each batch
- Gradient descent: Learning Rate ร Gradient = weight update
- Balancing act: fast enough to converge, slow enough to be stable
- Schedules: adaptive strategies (decay, warm-up, cosine)
- Make or break: wrong LR = total failure, right LR = magic! ๐ฅ
โก Advantages / Disadvantages / Limitations
โ Advantages (when well tuned)
- Controls convergence speed: right LR = fast training
- Prevents instability: small LR = stable gradients
- Simple concept: one number to rule them all
- Universal: works for any gradient-based optimization
- Huge impact: 10x better results with right LR
โ Disadvantages
- Hard to tune: trial and error, time-consuming
- Problem dependent: LR for MNIST โ LR for ImageNet
- Layer dependent: early layers need different LR than late layers
- Batch size dependent: larger batch = can use larger LR
- Architecture dependent: ResNet LR โ Transformer LR
โ ๏ธ Limitations
- One size doesn't fit all: needs adjustment per problem
- No universal value: 0.001 works sometimes, not always
- Sensitive: 0.01 vs 0.001 = completely different results
- Requires monitoring: loss exploding/flatline = wrong LR
- Replaced by adaptive: Adam/AdamW auto-adjust (but still need base LR)
๐ ๏ธ Practical Tutorial: My Real Case
๐ Setup
- Model: ResNet-18 on CIFAR-10
- Dataset: 50k train images, 10k test images
- Hardware: GTX 1080 Ti 11GB (batch size 128 fits perfectly!)
- Optimizer: SGD with momentum (0.9)
- Epochs: 100
๐ Results Obtained
Learning Rate = 0.1 (TOO HIGH):
- Epoch 1: Loss = 2.3 โ 5.8 โ NaN
- Accuracy: 10% (random guessing)
- Status: EXPLODED โ๐ฅ
Learning Rate = 0.01 (GOOD):
- Epoch 1: Loss = 2.3 โ 1.8
- Epoch 50: Loss = 0.4, Acc = 85%
- Epoch 100: Loss = 0.2, Acc = 91%
- Status: CONVERGED โ
Learning Rate = 0.001 (TOO LOW):
- Epoch 1: Loss = 2.3 โ 2.28
- Epoch 50: Loss = 1.2, Acc = 65%
- Epoch 100: Loss = 0.8, Acc = 74%
- Status: TOO SLOW, STUCK โ๐
Learning Rate = 0.01 with Cosine Decay:
- Starts at 0.01, decays smoothly
- Epoch 100: Loss = 0.15, Acc = 93%
- Best performance! โ
๐
Learning Rate Warm-up (0 โ 0.01 over 5 epochs):
- Epoch 1: LR = 0.002 (gentle start)
- Epoch 5: LR = 0.01 (full speed)
- Final: Acc = 92.5%
- More stable early training โ
๐งช Real-world Testing on GTX 1080 Ti
ResNet-18 CIFAR-10 (GTX 1080 Ti):
- Batch size 128: 11GB VRAM โ 8.5GB used
- LR = 0.01: 180 it/s, converges epoch 80
- LR = 0.1: crashes epoch 3 (NaN loss)
- LR = 0.001: 180 it/s, stuck at 74% acc
Transformer (GPT-2 Small) on GTX 1080 Ti:
- Batch size 8: 10.8GB VRAM used
- LR = 0.0001: stable, converges
- LR = 0.001: diverges after epoch 2
- Warm-up crucial: 0 โ 0.0001 over 1000 steps
GAN Training (StyleGAN2):
- Generator LR = 0.002
- Discriminator LR = 0.0002 (10x lower!)
- Balance crucial: same LR = mode collapse
- GTX 1080 Ti: batch size 16 max
Verdict: ๐ฏ LEARNING RATE = MOST CRITICAL HYPERPARAMETER
๐ก Concrete Examples
Visual metaphor: Driving down a mountain
LR = 0.1 (Too high) ๐๐ฅ
You're driving at 200 km/h down a winding mountain road
โ Miss the turns
โ Fly off the cliff
โ CRASH AND BURN
โ Loss = NaN
LR = 0.01 (Perfect) ๐โ
You're driving at 60 km/h
โ Take the curves safely
โ Reach the bottom smoothly
โ Optimal convergence
โ Loss โ minimum
LR = 0.001 (Too low) ๐๐
You're driving at 5 km/h
โ You'll reach the bottom... eventually
โ But it takes FOREVER
โ Might get stuck in a pothole (local minimum)
โ Waste of time
Learning Rate Schedules
Step Decay ๐
Epochs 1-30: LR = 0.01
Epochs 31-60: LR = 0.001 (รท10)
Epochs 61-90: LR = 0.0001 (รท10)
Epochs 91-100: LR = 0.00001 (รท10)
Effect: Big steps at start, refinement at end
Exponential Decay ๐
LR(epoch) = initial_LR ร decay_rate^epoch
Example:
Epoch 0: LR = 0.01
Epoch 10: LR = 0.01 ร 0.95^10 = 0.006
Epoch 50: LR = 0.01 ร 0.95^50 = 0.0008
Effect: Smooth continuous decrease
Cosine Annealing ๐
LR(epoch) = min_LR + 0.5 ร (max_LR - min_LR) ร (1 + cos(ฯ ร epoch / max_epochs))
Example (max_LR=0.01, min_LR=0.0001):
Epoch 0: LR = 0.01
Epoch 25: LR = 0.007
Epoch 50: LR = 0.005
Epoch 75: LR = 0.002
Epoch 100: LR = 0.0001
Effect: Smooth wave-like decrease, popular in modern training
Warm-up + Cosine ๐ฅ
Phase 1 (Warm-up): Linear increase 0 โ max_LR
Phase 2 (Cosine): Smooth decrease max_LR โ min_LR
Epochs 1-5: 0 โ 0.01 (warm-up)
Epochs 6-100: 0.01 โ 0.0001 (cosine)
Effect: Gentle start, stable convergence
Used by: BERT, GPT, modern Transformers
Real applications
Computer Vision (ResNet, EfficientNet) ๐ธ
- Base LR: 0.01-0.1 with SGD
- Schedule: Cosine or Step Decay
- Warm-up: 5-10 epochs
- Batch size: as large as VRAM allows
NLP (BERT, GPT, Transformers) ๐
- Base LR: 0.0001-0.001 with Adam
- Schedule: Linear decay or Cosine
- Warm-up: CRITICAL (1000-10000 steps)
- Gradient clipping: max_norm=1.0
GANs (StyleGAN, DCGAN) ๐จ
- Generator: 0.002
- Discriminator: 0.0002 (lower!)
- Optimizer: Adam with ฮฒ1=0.0, ฮฒ2=0.99
- No warm-up, constant or slight decay
Reinforcement Learning (PPO, DQN) ๐ฎ
- Base LR: 0.0001-0.001
- Decay: often constant (no schedule)
- Optimizer: Adam
- Highly sensitive to LR
๐ Cheat Sheet: Learning Rate
๐ Symptoms & Solutions
Loss explodes (โ NaN) ๐ฅ
Symptom: Loss goes 2.3 โ 10.5 โ 500 โ NaN
Cause: Learning Rate TOO HIGH
Solution: Divide LR by 10 (0.01 โ 0.001)
Loss barely decreases ๐
Symptom: Loss goes 2.3 โ 2.28 โ 2.25 (super slow)
Cause: Learning Rate TOO LOW
Solution: Multiply LR by 10 (0.0001 โ 0.001)
Loss oscillates wildly ๐ข
Symptom: Loss goes 1.5 โ 0.8 โ 2.1 โ 1.2 (chaos)
Cause: LR too high OR batch size too small
Solution: Reduce LR or increase batch size
Stuck in plateau ๐๏ธ
Symptom: Loss stuck at 0.5 for 20 epochs
Cause: LR too low to escape local minimum
Solution: Learning Rate schedule (decay) or increase LR
โ๏ธ Recommended Starting Values
SGD (no momentum):
- Simple: 0.01
- Complex: 0.001
SGD with momentum:
- CV (ResNet, VGG): 0.01-0.1
- Decay: Cosine or Step รท10 every 30 epochs
Adam/AdamW:
- NLP (BERT, GPT): 0.0001-0.001
- CV (ViT): 0.001-0.003
- Small models: 0.001
- Large models (GPT-3): 0.00001
RMSprop:
- Default: 0.001
- GANs: 0.0002
Batch size scaling:
- LR scales with sqrt(batch_size)
- Batch 32 โ LR = 0.001
- Batch 128 โ LR = 0.002
- Batch 512 โ LR = 0.004
๐ ๏ธ LR Finder Trick
1. Start with tiny LR (0.000001)
2. Increase exponentially each batch (ร1.1)
3. Plot loss vs LR
4. Find where loss drops fastest
5. Use LR slightly before minimum
Example plot:
LR 0.00001: Loss = 2.3
LR 0.0001: Loss = 2.3
LR 0.001: Loss = 1.5 โ Drops fast!
LR 0.01: Loss = 0.8 โ Sweet spot
LR 0.1: Loss = 3.5 โ Too high
Choose: 0.01 (or 0.005 to be safe)
๐ป Simplified Concept (minimal code)
import torch
import torch.nn as nn
import torch.optim as optim
# Learning Rate comparison - ultra-simple
class LearningRateComparison:
def __init__(self, model):
self.model = model
self.criterion = nn.CrossEntropyLoss()
def train_with_lr(self, train_loader, lr, epochs):
"""Train with specific Learning Rate"""
optimizer = optim.SGD(
self.model.parameters(),
lr=lr,
momentum=0.9
)
for epoch in range(epochs):
total_loss = 0
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = self.model(inputs)
loss = self.criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}, LR={lr}, Loss={avg_loss:.4f}")
def train_with_schedule(self, train_loader, initial_lr, epochs):
"""Train with Cosine Annealing schedule"""
optimizer = optim.SGD(
self.model.parameters(),
lr=initial_lr,
momentum=0.9
)
# Cosine Annealing scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=epochs,
eta_min=initial_lr * 0.01
)
for epoch in range(epochs):
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = self.model(inputs)
loss = self.criterion(outputs, labels)
loss.backward()
optimizer.step()
current_lr = scheduler.get_last_lr()[0]
print(f"Epoch {epoch+1}, LR={current_lr:.6f}")
scheduler.step()
def train_with_warmup(self, train_loader, target_lr, warmup_epochs, total_epochs):
"""Train with warm-up + cosine decay"""
optimizer = optim.SGD(
self.model.parameters(),
lr=target_lr * 0.1,
momentum=0.9
)
for epoch in range(total_epochs):
# Warm-up phase
if epoch < warmup_epochs:
lr = target_lr * (epoch + 1) / warmup_epochs
for param_group in optimizer.param_groups:
param_group['lr'] = lr
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = self.model(inputs)
loss = self.criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, LR={optimizer.param_groups[0]['lr']:.6f}")
# Usage comparison on GTX 1080 Ti
model = ResNet18()
# Test 1: Too high
trainer = LearningRateComparison(model)
trainer.train_with_lr(train_loader, lr=0.1, epochs=10)
# Test 2: Perfect
trainer.train_with_lr(train_loader, lr=0.01, epochs=100)
# Test 3: With schedule (best!)
trainer.train_with_schedule(train_loader, initial_lr=0.01, epochs=100)
The key concept: Learning Rate controls how fast you update weights. Too fast = unstable, too slow = inefficient. Modern approach: start high, decay smoothly (cosine annealing) with warm-up for stability! ๐ฏ
๐ Summary
Learning Rate = gas pedal of neural networks! Controls step size during gradient descent. Too high = explosion, too low = snail pace. Modern training uses schedules (cosine annealing, step decay) and warm-up for stability. Most critical hyperparameter: right LR = convergence, wrong LR = total failure. On GTX 1080 Ti, typical values: 0.01 for CV (SGD), 0.0001 for NLP (Adam)! ๐๐จ
๐ฏ Conclusion
Learning Rate is the single most important hyperparameter in deep learning. A wrong LR can make even the best architecture fail completely. Too high = divergence, too low = waste of time. Modern techniques (warm-up, cosine annealing, adaptive optimizers) have made training more robust, but you still need to tune the base LR. Rule of thumb: start with standard values (0.01 for SGD, 0.001 for Adam), use LR finder to refine, add schedule for best results. On GTX 1080 Ti, batch size affects optimal LR, so experiment! The difference between 91% and 93% accuracy is often just the right LR! ๐๐ฅ
โ Questions & Answers
Q: My training loss goes to NaN after a few epochs, what's happening? A: Your Learning Rate is way too high! The gradients are exploding. Divide your LR by 10 (if you used 0.01, try 0.001). Also check: (1) Gradient clipping (clip max norm to 1.0), (2) Batch normalization in your architecture, (3) Weight initialization (use Xavier or He initialization). If it still happens, your data might have extreme outliers or your architecture is unstable.
Q: How do I know if my Learning Rate is too low? A: Look at the loss curve: if it's decreasing super slowly (2.3 โ 2.28 โ 2.25 over 10 epochs), your LR is too low. Also, if you're stuck in the same accuracy for many epochs, try increasing LR by 5-10x. Use LR finder to identify the sweet spot: plot loss vs LR and pick where loss drops fastest!
Q: Should I use the same Learning Rate for all layers? A: Not always! Modern approaches use layer-wise LR: (1) Transfer learning: lower LR for pretrained layers (0.0001), higher for new head (0.001), (2) Discriminative fine-tuning: each layer group gets different LR, (3) BERT-style: layer decay (lower layers = lower LR). For training from scratch, same LR usually works. But for fine-tuning, definitely use different LRs!
๐ค Did You Know?
The Learning Rate was first identified as critical in the 1980s during early neural network research, but it became truly famous in 2012 when AlexNet won ImageNet. The team discovered that using LR=0.01 with momentum and dividing by 10 every 30 epochs was the secret sauce! Before that, researchers used constant LR and wondered why training was so unstable. Fun fact: Geoffrey Hinton (one of the godfathers of deep learning) once said "tuning the learning rate is 90% of deep learning"! The invention of Adam optimizer in 2014 by Kingma and Ba was revolutionary because it adapts the LR automatically for each parameter, but even Adam needs a good base LR! Modern breakthroughs like GPT and BERT all use warm-up schedules (invented around 2017) where LR starts at 0 and increases linearly for the first 1000-10000 steps. This prevents the "cold start problem" where early aggressive updates mess up initialization. Today, cosine annealing (smoothly decreasing LR following a cosine curve) is the most popular schedule, used by almost every state-of-the-art model! ๐ฏ๐ฅ๐
Thรฉo CHARLET
IT Systems & Networks Student - AI/ML Specialization
Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)
๐ LinkedIn: https://www.linkedin.com/in/thรฉo-charlet
๐ Seeking internship opportunities
๐ Website : https://rdtvlokip.fr