🎯 Learning Rate — The gas pedal of your neural network! 🚗💨

Community Article Published December 21, 2025

📖 Definition

⚡ Advantages / Disadvantages / Limitations
✅ Advantages (when well tuned)

❌ Disadvantages

⚠️ Limitations

🛠️ Practical Tutorial: My Real Case
📊 Setup

📈 Results Obtained

🧪 Real-world Testing on GTX 1080 Ti

💡 Concrete Examples
Visual metaphor: Driving down a mountain

Learning Rate Schedules

Real applications

📋 Cheat Sheet: Learning Rate
🔍 Symptoms & Solutions

⚙️ Recommended Starting Values

🛠️ LR Finder Trick

💻 Simplified Concept (minimal code)

📝 Summary

🎯 Conclusion

❓ Questions & Answers

🤓 Did You Know?

📖 Definition

Learning Rate = the speed at which your neural network learns! Too slow = takes forever. Too fast = crashes into walls and explodes! It's the most critical hyperparameter that controls how big the steps are during gradient descent.

Principle:

Step size: how much to adjust weights after each batch
Gradient descent: Learning Rate × Gradient = weight update
Balancing act: fast enough to converge, slow enough to be stable
Schedules: adaptive strategies (decay, warm-up, cosine)
Make or break: wrong LR = total failure, right LR = magic! 🔥

⚡ Advantages / Disadvantages / Limitations

✅ Advantages (when well tuned)

Controls convergence speed: right LR = fast training
Prevents instability: small LR = stable gradients
Simple concept: one number to rule them all
Universal: works for any gradient-based optimization
Huge impact: 10x better results with right LR

❌ Disadvantages

Hard to tune: trial and error, time-consuming
Problem dependent: LR for MNIST ≠ LR for ImageNet
Layer dependent: early layers need different LR than late layers
Batch size dependent: larger batch = can use larger LR
Architecture dependent: ResNet LR ≠ Transformer LR

⚠️ Limitations

One size doesn't fit all: needs adjustment per problem
No universal value: 0.001 works sometimes, not always
Sensitive: 0.01 vs 0.001 = completely different results
Requires monitoring: loss exploding/flatline = wrong LR
Replaced by adaptive: Adam/AdamW auto-adjust (but still need base LR)

🛠️ Practical Tutorial: My Real Case

📊 Setup

Model: ResNet-18 on CIFAR-10
Dataset: 50k train images, 10k test images
Hardware: GTX 1080 Ti 11GB (batch size 128 fits perfectly!)
Optimizer: SGD with momentum (0.9)
Epochs: 100

📈 Results Obtained

Learning Rate = 0.1 (TOO HIGH):
- Epoch 1: Loss = 2.3 → 5.8 → NaN
- Accuracy: 10% (random guessing)
- Status: EXPLODED ❌💥

Learning Rate = 0.01 (GOOD):
- Epoch 1: Loss = 2.3 → 1.8
- Epoch 50: Loss = 0.4, Acc = 85%
- Epoch 100: Loss = 0.2, Acc = 91%
- Status: CONVERGED ✅

Learning Rate = 0.001 (TOO LOW):
- Epoch 1: Loss = 2.3 → 2.28
- Epoch 50: Loss = 1.2, Acc = 65%
- Epoch 100: Loss = 0.8, Acc = 74%
- Status: TOO SLOW, STUCK ❌🐌

Learning Rate = 0.01 with Cosine Decay:
- Starts at 0.01, decays smoothly
- Epoch 100: Loss = 0.15, Acc = 93%
- Best performance! ✅🏆

Learning Rate Warm-up (0 → 0.01 over 5 epochs):
- Epoch 1: LR = 0.002 (gentle start)
- Epoch 5: LR = 0.01 (full speed)
- Final: Acc = 92.5%
- More stable early training ✅

🧪 Real-world Testing on GTX 1080 Ti

ResNet-18 CIFAR-10 (GTX 1080 Ti):
- Batch size 128: 11GB VRAM → 8.5GB used
- LR = 0.01: 180 it/s, converges epoch 80
- LR = 0.1: crashes epoch 3 (NaN loss)
- LR = 0.001: 180 it/s, stuck at 74% acc

Transformer (GPT-2 Small) on GTX 1080 Ti:
- Batch size 8: 10.8GB VRAM used
- LR = 0.0001: stable, converges
- LR = 0.001: diverges after epoch 2
- Warm-up crucial: 0 → 0.0001 over 1000 steps

GAN Training (StyleGAN2):
- Generator LR = 0.002
- Discriminator LR = 0.0002 (10x lower!)
- Balance crucial: same LR = mode collapse
- GTX 1080 Ti: batch size 16 max

Verdict: 🎯 LEARNING RATE = MOST CRITICAL HYPERPARAMETER

💡 Concrete Examples

Visual metaphor: Driving down a mountain

LR = 0.1 (Too high) 🚗💥

You're driving at 200 km/h down a winding mountain road
→ Miss the turns
→ Fly off the cliff
→ CRASH AND BURN
→ Loss = NaN

LR = 0.01 (Perfect) 🚗✅

You're driving at 60 km/h
→ Take the curves safely
→ Reach the bottom smoothly
→ Optimal convergence
→ Loss → minimum

LR = 0.001 (Too low) 🚗🐌

You're driving at 5 km/h
→ You'll reach the bottom... eventually
→ But it takes FOREVER
→ Might get stuck in a pothole (local minimum)
→ Waste of time

Learning Rate Schedules

Step Decay 📉

Epochs 1-30: LR = 0.01
Epochs 31-60: LR = 0.001 (÷10)
Epochs 61-90: LR = 0.0001 (÷10)
Epochs 91-100: LR = 0.00001 (÷10)

Effect: Big steps at start, refinement at end

Exponential Decay 📉

LR(epoch) = initial_LR × decay_rate^epoch

Example:
Epoch 0: LR = 0.01
Epoch 10: LR = 0.01 × 0.95^10 = 0.006
Epoch 50: LR = 0.01 × 0.95^50 = 0.0008

Effect: Smooth continuous decrease

Cosine Annealing 🌊

LR(epoch) = min_LR + 0.5 × (max_LR - min_LR) × (1 + cos(π × epoch / max_epochs))

Example (max_LR=0.01, min_LR=0.0001):
Epoch 0: LR = 0.01
Epoch 25: LR = 0.007
Epoch 50: LR = 0.005
Epoch 75: LR = 0.002
Epoch 100: LR = 0.0001

Effect: Smooth wave-like decrease, popular in modern training

Warm-up + Cosine 🔥

Phase 1 (Warm-up): Linear increase 0 → max_LR
Phase 2 (Cosine): Smooth decrease max_LR → min_LR

Epochs 1-5: 0 → 0.01 (warm-up)
Epochs 6-100: 0.01 → 0.0001 (cosine)

Effect: Gentle start, stable convergence
Used by: BERT, GPT, modern Transformers

Real applications

Computer Vision (ResNet, EfficientNet) 📸

Base LR: 0.01-0.1 with SGD
Schedule: Cosine or Step Decay
Warm-up: 5-10 epochs
Batch size: as large as VRAM allows

NLP (BERT, GPT, Transformers) 📝

Base LR: 0.0001-0.001 with Adam
Schedule: Linear decay or Cosine
Warm-up: CRITICAL (1000-10000 steps)
Gradient clipping: max_norm=1.0

GANs (StyleGAN, DCGAN) 🎨

Generator: 0.002
Discriminator: 0.0002 (lower!)
Optimizer: Adam with β1=0.0, β2=0.99
No warm-up, constant or slight decay

Reinforcement Learning (PPO, DQN) 🎮

Base LR: 0.0001-0.001
Decay: often constant (no schedule)
Optimizer: Adam
Highly sensitive to LR

📋 Cheat Sheet: Learning Rate

🔍 Symptoms & Solutions

Loss explodes (→ NaN) 💥

Symptom: Loss goes 2.3 → 10.5 → 500 → NaN
Cause: Learning Rate TOO HIGH
Solution: Divide LR by 10 (0.01 → 0.001)

Loss barely decreases 🐌

Symptom: Loss goes 2.3 → 2.28 → 2.25 (super slow)
Cause: Learning Rate TOO LOW
Solution: Multiply LR by 10 (0.0001 → 0.001)

Loss oscillates wildly 🎢

Symptom: Loss goes 1.5 → 0.8 → 2.1 → 1.2 (chaos)
Cause: LR too high OR batch size too small
Solution: Reduce LR or increase batch size

Stuck in plateau 🏔️

Symptom: Loss stuck at 0.5 for 20 epochs
Cause: LR too low to escape local minimum
Solution: Learning Rate schedule (decay) or increase LR

⚙️ Recommended Starting Values

SGD (no momentum):
- Simple: 0.01
- Complex: 0.001

SGD with momentum:
- CV (ResNet, VGG): 0.01-0.1
- Decay: Cosine or Step ÷10 every 30 epochs

Adam/AdamW:
- NLP (BERT, GPT): 0.0001-0.001
- CV (ViT): 0.001-0.003
- Small models: 0.001
- Large models (GPT-3): 0.00001

RMSprop:
- Default: 0.001
- GANs: 0.0002

Batch size scaling:
- LR scales with sqrt(batch_size)
- Batch 32 → LR = 0.001
- Batch 128 → LR = 0.002
- Batch 512 → LR = 0.004

🛠️ LR Finder Trick

1. Start with tiny LR (0.000001)
2. Increase exponentially each batch (×1.1)
3. Plot loss vs LR
4. Find where loss drops fastest
5. Use LR slightly before minimum

Example plot:
LR 0.00001: Loss = 2.3
LR 0.0001: Loss = 2.3
LR 0.001: Loss = 1.5 ← Drops fast!
LR 0.01: Loss = 0.8 ← Sweet spot
LR 0.1: Loss = 3.5 ← Too high

Choose: 0.01 (or 0.005 to be safe)

💻 Simplified Concept (minimal code)

import torch
import torch.nn as nn
import torch.optim as optim

# Learning Rate comparison - ultra-simple
class LearningRateComparison:
    def __init__(self, model):
        self.model = model
        self.criterion = nn.CrossEntropyLoss()
    
    def train_with_lr(self, train_loader, lr, epochs):
        """Train with specific Learning Rate"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=lr,
            momentum=0.9
        )
        
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item()
            
            avg_loss = total_loss / len(train_loader)
            print(f"Epoch {epoch+1}, LR={lr}, Loss={avg_loss:.4f}")
    
    def train_with_schedule(self, train_loader, initial_lr, epochs):
        """Train with Cosine Annealing schedule"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=initial_lr,
            momentum=0.9
        )
        
        # Cosine Annealing scheduler
        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=epochs,
            eta_min=initial_lr * 0.01
        )
        
        for epoch in range(epochs):
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            
            current_lr = scheduler.get_last_lr()[0]
            print(f"Epoch {epoch+1}, LR={current_lr:.6f}")
            
            scheduler.step()
    
    def train_with_warmup(self, train_loader, target_lr, warmup_epochs, total_epochs):
        """Train with warm-up + cosine decay"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=target_lr * 0.1,
            momentum=0.9
        )
        
        for epoch in range(total_epochs):
            # Warm-up phase
            if epoch < warmup_epochs:
                lr = target_lr * (epoch + 1) / warmup_epochs
                for param_group in optimizer.param_groups:
                    param_group['lr'] = lr
            
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            
            print(f"Epoch {epoch+1}, LR={optimizer.param_groups[0]['lr']:.6f}")

# Usage comparison on GTX 1080 Ti
model = ResNet18()

# Test 1: Too high
trainer = LearningRateComparison(model)
trainer.train_with_lr(train_loader, lr=0.1, epochs=10)

# Test 2: Perfect
trainer.train_with_lr(train_loader, lr=0.01, epochs=100)

# Test 3: With schedule (best!)
trainer.train_with_schedule(train_loader, initial_lr=0.01, epochs=100)

The key concept: Learning Rate controls how fast you update weights. Too fast = unstable, too slow = inefficient. Modern approach: start high, decay smoothly (cosine annealing) with warm-up for stability! 🎯

📝 Summary

Learning Rate = gas pedal of neural networks! Controls step size during gradient descent. Too high = explosion, too low = snail pace. Modern training uses schedules (cosine annealing, step decay) and warm-up for stability. Most critical hyperparameter: right LR = convergence, wrong LR = total failure. On GTX 1080 Ti, typical values: 0.01 for CV (SGD), 0.0001 for NLP (Adam)! 🚗💨

🎯 Conclusion

Learning Rate is the single most important hyperparameter in deep learning. A wrong LR can make even the best architecture fail completely. Too high = divergence, too low = waste of time. Modern techniques (warm-up, cosine annealing, adaptive optimizers) have made training more robust, but you still need to tune the base LR. Rule of thumb: start with standard values (0.01 for SGD, 0.001 for Adam), use LR finder to refine, add schedule for best results. On GTX 1080 Ti, batch size affects optimal LR, so experiment! The difference between 91% and 93% accuracy is often just the right LR! 🏆🔥

❓ Questions & Answers

Q: My training loss goes to NaN after a few epochs, what's happening? A: Your Learning Rate is way too high! The gradients are exploding. Divide your LR by 10 (if you used 0.01, try 0.001). Also check: (1) Gradient clipping (clip max norm to 1.0), (2) Batch normalization in your architecture, (3) Weight initialization (use Xavier or He initialization). If it still happens, your data might have extreme outliers or your architecture is unstable.

Q: How do I know if my Learning Rate is too low? A: Look at the loss curve: if it's decreasing super slowly (2.3 → 2.28 → 2.25 over 10 epochs), your LR is too low. Also, if you're stuck in the same accuracy for many epochs, try increasing LR by 5-10x. Use LR finder to identify the sweet spot: plot loss vs LR and pick where loss drops fastest!

Q: Should I use the same Learning Rate for all layers? A: Not always! Modern approaches use layer-wise LR: (1) Transfer learning: lower LR for pretrained layers (0.0001), higher for new head (0.001), (2) Discriminative fine-tuning: each layer group gets different LR, (3) BERT-style: layer decay (lower layers = lower LR). For training from scratch, same LR usually works. But for fine-tuning, definitely use different LRs!

🤓 Did You Know?

The Learning Rate was first identified as critical in the 1980s during early neural network research, but it became truly famous in 2012 when AlexNet won ImageNet. The team discovered that using LR=0.01 with momentum and dividing by 10 every 30 epochs was the secret sauce! Before that, researchers used constant LR and wondered why training was so unstable. Fun fact: Geoffrey Hinton (one of the godfathers of deep learning) once said "tuning the learning rate is 90% of deep learning"! The invention of Adam optimizer in 2014 by Kingma and Ba was revolutionary because it adapts the LR automatically for each parameter, but even Adam needs a good base LR! Modern breakthroughs like GPT and BERT all use warm-up schedules (invented around 2017) where LR starts at 0 and increases linearly for the first 1000-10000 steps. This prevents the "cold start problem" where early aggressive updates mess up initialization. Today, cosine annealing (smoothly decreasing LR following a cosine curve) is the most popular schedule, used by almost every state-of-the-art model! 🎯🔥🚀

Théo CHARLET

IT Systems & Networks Student - AI/ML Specialization

Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)

🔗 LinkedIn: https://www.linkedin.com/in/théo-charlet

🚀 Seeking internship opportunities

🔗 Website : https://rdtvlokip.fr

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote