๐ŸŽฏ Learning Rate โ€” The gas pedal of your neural network! ๐Ÿš—๐Ÿ’จ

Community Article Published December 21, 2025

๐Ÿ“– Definition

Learning Rate = the speed at which your neural network learns! Too slow = takes forever. Too fast = crashes into walls and explodes! It's the most critical hyperparameter that controls how big the steps are during gradient descent.

Principle:

  • Step size: how much to adjust weights after each batch
  • Gradient descent: Learning Rate ร— Gradient = weight update
  • Balancing act: fast enough to converge, slow enough to be stable
  • Schedules: adaptive strategies (decay, warm-up, cosine)
  • Make or break: wrong LR = total failure, right LR = magic! ๐Ÿ”ฅ

โšก Advantages / Disadvantages / Limitations

โœ… Advantages (when well tuned)

  • Controls convergence speed: right LR = fast training
  • Prevents instability: small LR = stable gradients
  • Simple concept: one number to rule them all
  • Universal: works for any gradient-based optimization
  • Huge impact: 10x better results with right LR

โŒ Disadvantages

  • Hard to tune: trial and error, time-consuming
  • Problem dependent: LR for MNIST โ‰  LR for ImageNet
  • Layer dependent: early layers need different LR than late layers
  • Batch size dependent: larger batch = can use larger LR
  • Architecture dependent: ResNet LR โ‰  Transformer LR

โš ๏ธ Limitations

  • One size doesn't fit all: needs adjustment per problem
  • No universal value: 0.001 works sometimes, not always
  • Sensitive: 0.01 vs 0.001 = completely different results
  • Requires monitoring: loss exploding/flatline = wrong LR
  • Replaced by adaptive: Adam/AdamW auto-adjust (but still need base LR)

๐Ÿ› ๏ธ Practical Tutorial: My Real Case

๐Ÿ“Š Setup

  • Model: ResNet-18 on CIFAR-10
  • Dataset: 50k train images, 10k test images
  • Hardware: GTX 1080 Ti 11GB (batch size 128 fits perfectly!)
  • Optimizer: SGD with momentum (0.9)
  • Epochs: 100

๐Ÿ“ˆ Results Obtained

Learning Rate = 0.1 (TOO HIGH):
- Epoch 1: Loss = 2.3 โ†’ 5.8 โ†’ NaN
- Accuracy: 10% (random guessing)
- Status: EXPLODED โŒ๐Ÿ’ฅ

Learning Rate = 0.01 (GOOD):
- Epoch 1: Loss = 2.3 โ†’ 1.8
- Epoch 50: Loss = 0.4, Acc = 85%
- Epoch 100: Loss = 0.2, Acc = 91%
- Status: CONVERGED โœ…

Learning Rate = 0.001 (TOO LOW):
- Epoch 1: Loss = 2.3 โ†’ 2.28
- Epoch 50: Loss = 1.2, Acc = 65%
- Epoch 100: Loss = 0.8, Acc = 74%
- Status: TOO SLOW, STUCK โŒ๐ŸŒ

Learning Rate = 0.01 with Cosine Decay:
- Starts at 0.01, decays smoothly
- Epoch 100: Loss = 0.15, Acc = 93%
- Best performance! โœ…๐Ÿ†

Learning Rate Warm-up (0 โ†’ 0.01 over 5 epochs):
- Epoch 1: LR = 0.002 (gentle start)
- Epoch 5: LR = 0.01 (full speed)
- Final: Acc = 92.5%
- More stable early training โœ…

๐Ÿงช Real-world Testing on GTX 1080 Ti

ResNet-18 CIFAR-10 (GTX 1080 Ti):
- Batch size 128: 11GB VRAM โ†’ 8.5GB used
- LR = 0.01: 180 it/s, converges epoch 80
- LR = 0.1: crashes epoch 3 (NaN loss)
- LR = 0.001: 180 it/s, stuck at 74% acc

Transformer (GPT-2 Small) on GTX 1080 Ti:
- Batch size 8: 10.8GB VRAM used
- LR = 0.0001: stable, converges
- LR = 0.001: diverges after epoch 2
- Warm-up crucial: 0 โ†’ 0.0001 over 1000 steps

GAN Training (StyleGAN2):
- Generator LR = 0.002
- Discriminator LR = 0.0002 (10x lower!)
- Balance crucial: same LR = mode collapse
- GTX 1080 Ti: batch size 16 max

Verdict: ๐ŸŽฏ LEARNING RATE = MOST CRITICAL HYPERPARAMETER


๐Ÿ’ก Concrete Examples

Visual metaphor: Driving down a mountain

LR = 0.1 (Too high) ๐Ÿš—๐Ÿ’ฅ

You're driving at 200 km/h down a winding mountain road
โ†’ Miss the turns
โ†’ Fly off the cliff
โ†’ CRASH AND BURN
โ†’ Loss = NaN

LR = 0.01 (Perfect) ๐Ÿš—โœ…

You're driving at 60 km/h
โ†’ Take the curves safely
โ†’ Reach the bottom smoothly
โ†’ Optimal convergence
โ†’ Loss โ†’ minimum

LR = 0.001 (Too low) ๐Ÿš—๐ŸŒ

You're driving at 5 km/h
โ†’ You'll reach the bottom... eventually
โ†’ But it takes FOREVER
โ†’ Might get stuck in a pothole (local minimum)
โ†’ Waste of time

Learning Rate Schedules

Step Decay ๐Ÿ“‰

Epochs 1-30: LR = 0.01
Epochs 31-60: LR = 0.001 (รท10)
Epochs 61-90: LR = 0.0001 (รท10)
Epochs 91-100: LR = 0.00001 (รท10)

Effect: Big steps at start, refinement at end

Exponential Decay ๐Ÿ“‰

LR(epoch) = initial_LR ร— decay_rate^epoch

Example:
Epoch 0: LR = 0.01
Epoch 10: LR = 0.01 ร— 0.95^10 = 0.006
Epoch 50: LR = 0.01 ร— 0.95^50 = 0.0008

Effect: Smooth continuous decrease

Cosine Annealing ๐ŸŒŠ

LR(epoch) = min_LR + 0.5 ร— (max_LR - min_LR) ร— (1 + cos(ฯ€ ร— epoch / max_epochs))

Example (max_LR=0.01, min_LR=0.0001):
Epoch 0: LR = 0.01
Epoch 25: LR = 0.007
Epoch 50: LR = 0.005
Epoch 75: LR = 0.002
Epoch 100: LR = 0.0001

Effect: Smooth wave-like decrease, popular in modern training

Warm-up + Cosine ๐Ÿ”ฅ

Phase 1 (Warm-up): Linear increase 0 โ†’ max_LR
Phase 2 (Cosine): Smooth decrease max_LR โ†’ min_LR

Epochs 1-5: 0 โ†’ 0.01 (warm-up)
Epochs 6-100: 0.01 โ†’ 0.0001 (cosine)

Effect: Gentle start, stable convergence
Used by: BERT, GPT, modern Transformers

Real applications

Computer Vision (ResNet, EfficientNet) ๐Ÿ“ธ

  • Base LR: 0.01-0.1 with SGD
  • Schedule: Cosine or Step Decay
  • Warm-up: 5-10 epochs
  • Batch size: as large as VRAM allows

NLP (BERT, GPT, Transformers) ๐Ÿ“

  • Base LR: 0.0001-0.001 with Adam
  • Schedule: Linear decay or Cosine
  • Warm-up: CRITICAL (1000-10000 steps)
  • Gradient clipping: max_norm=1.0

GANs (StyleGAN, DCGAN) ๐ŸŽจ

  • Generator: 0.002
  • Discriminator: 0.0002 (lower!)
  • Optimizer: Adam with ฮฒ1=0.0, ฮฒ2=0.99
  • No warm-up, constant or slight decay

Reinforcement Learning (PPO, DQN) ๐ŸŽฎ

  • Base LR: 0.0001-0.001
  • Decay: often constant (no schedule)
  • Optimizer: Adam
  • Highly sensitive to LR

๐Ÿ“‹ Cheat Sheet: Learning Rate

๐Ÿ” Symptoms & Solutions

Loss explodes (โ†’ NaN) ๐Ÿ’ฅ

Symptom: Loss goes 2.3 โ†’ 10.5 โ†’ 500 โ†’ NaN
Cause: Learning Rate TOO HIGH
Solution: Divide LR by 10 (0.01 โ†’ 0.001)

Loss barely decreases ๐ŸŒ

Symptom: Loss goes 2.3 โ†’ 2.28 โ†’ 2.25 (super slow)
Cause: Learning Rate TOO LOW
Solution: Multiply LR by 10 (0.0001 โ†’ 0.001)

Loss oscillates wildly ๐ŸŽข

Symptom: Loss goes 1.5 โ†’ 0.8 โ†’ 2.1 โ†’ 1.2 (chaos)
Cause: LR too high OR batch size too small
Solution: Reduce LR or increase batch size

Stuck in plateau ๐Ÿ”๏ธ

Symptom: Loss stuck at 0.5 for 20 epochs
Cause: LR too low to escape local minimum
Solution: Learning Rate schedule (decay) or increase LR

โš™๏ธ Recommended Starting Values

SGD (no momentum):
- Simple: 0.01
- Complex: 0.001

SGD with momentum:
- CV (ResNet, VGG): 0.01-0.1
- Decay: Cosine or Step รท10 every 30 epochs

Adam/AdamW:
- NLP (BERT, GPT): 0.0001-0.001
- CV (ViT): 0.001-0.003
- Small models: 0.001
- Large models (GPT-3): 0.00001

RMSprop:
- Default: 0.001
- GANs: 0.0002

Batch size scaling:
- LR scales with sqrt(batch_size)
- Batch 32 โ†’ LR = 0.001
- Batch 128 โ†’ LR = 0.002
- Batch 512 โ†’ LR = 0.004

๐Ÿ› ๏ธ LR Finder Trick

1. Start with tiny LR (0.000001)
2. Increase exponentially each batch (ร—1.1)
3. Plot loss vs LR
4. Find where loss drops fastest
5. Use LR slightly before minimum

Example plot:
LR 0.00001: Loss = 2.3
LR 0.0001: Loss = 2.3
LR 0.001: Loss = 1.5 โ† Drops fast!
LR 0.01: Loss = 0.8 โ† Sweet spot
LR 0.1: Loss = 3.5 โ† Too high

Choose: 0.01 (or 0.005 to be safe)

๐Ÿ’ป Simplified Concept (minimal code)

import torch
import torch.nn as nn
import torch.optim as optim

# Learning Rate comparison - ultra-simple
class LearningRateComparison:
    def __init__(self, model):
        self.model = model
        self.criterion = nn.CrossEntropyLoss()
    
    def train_with_lr(self, train_loader, lr, epochs):
        """Train with specific Learning Rate"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=lr,
            momentum=0.9
        )
        
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item()
            
            avg_loss = total_loss / len(train_loader)
            print(f"Epoch {epoch+1}, LR={lr}, Loss={avg_loss:.4f}")
    
    def train_with_schedule(self, train_loader, initial_lr, epochs):
        """Train with Cosine Annealing schedule"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=initial_lr,
            momentum=0.9
        )
        
        # Cosine Annealing scheduler
        scheduler = optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=epochs,
            eta_min=initial_lr * 0.01
        )
        
        for epoch in range(epochs):
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            
            current_lr = scheduler.get_last_lr()[0]
            print(f"Epoch {epoch+1}, LR={current_lr:.6f}")
            
            scheduler.step()
    
    def train_with_warmup(self, train_loader, target_lr, warmup_epochs, total_epochs):
        """Train with warm-up + cosine decay"""
        
        optimizer = optim.SGD(
            self.model.parameters(),
            lr=target_lr * 0.1,
            momentum=0.9
        )
        
        for epoch in range(total_epochs):
            # Warm-up phase
            if epoch < warmup_epochs:
                lr = target_lr * (epoch + 1) / warmup_epochs
                for param_group in optimizer.param_groups:
                    param_group['lr'] = lr
            
            for inputs, labels in train_loader:
                optimizer.zero_grad()
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss.backward()
                optimizer.step()
            
            print(f"Epoch {epoch+1}, LR={optimizer.param_groups[0]['lr']:.6f}")

# Usage comparison on GTX 1080 Ti
model = ResNet18()

# Test 1: Too high
trainer = LearningRateComparison(model)
trainer.train_with_lr(train_loader, lr=0.1, epochs=10)

# Test 2: Perfect
trainer.train_with_lr(train_loader, lr=0.01, epochs=100)

# Test 3: With schedule (best!)
trainer.train_with_schedule(train_loader, initial_lr=0.01, epochs=100)

The key concept: Learning Rate controls how fast you update weights. Too fast = unstable, too slow = inefficient. Modern approach: start high, decay smoothly (cosine annealing) with warm-up for stability! ๐ŸŽฏ


๐Ÿ“ Summary

Learning Rate = gas pedal of neural networks! Controls step size during gradient descent. Too high = explosion, too low = snail pace. Modern training uses schedules (cosine annealing, step decay) and warm-up for stability. Most critical hyperparameter: right LR = convergence, wrong LR = total failure. On GTX 1080 Ti, typical values: 0.01 for CV (SGD), 0.0001 for NLP (Adam)! ๐Ÿš—๐Ÿ’จ


๐ŸŽฏ Conclusion

Learning Rate is the single most important hyperparameter in deep learning. A wrong LR can make even the best architecture fail completely. Too high = divergence, too low = waste of time. Modern techniques (warm-up, cosine annealing, adaptive optimizers) have made training more robust, but you still need to tune the base LR. Rule of thumb: start with standard values (0.01 for SGD, 0.001 for Adam), use LR finder to refine, add schedule for best results. On GTX 1080 Ti, batch size affects optimal LR, so experiment! The difference between 91% and 93% accuracy is often just the right LR! ๐Ÿ†๐Ÿ”ฅ


โ“ Questions & Answers

Q: My training loss goes to NaN after a few epochs, what's happening? A: Your Learning Rate is way too high! The gradients are exploding. Divide your LR by 10 (if you used 0.01, try 0.001). Also check: (1) Gradient clipping (clip max norm to 1.0), (2) Batch normalization in your architecture, (3) Weight initialization (use Xavier or He initialization). If it still happens, your data might have extreme outliers or your architecture is unstable.

Q: How do I know if my Learning Rate is too low? A: Look at the loss curve: if it's decreasing super slowly (2.3 โ†’ 2.28 โ†’ 2.25 over 10 epochs), your LR is too low. Also, if you're stuck in the same accuracy for many epochs, try increasing LR by 5-10x. Use LR finder to identify the sweet spot: plot loss vs LR and pick where loss drops fastest!

Q: Should I use the same Learning Rate for all layers? A: Not always! Modern approaches use layer-wise LR: (1) Transfer learning: lower LR for pretrained layers (0.0001), higher for new head (0.001), (2) Discriminative fine-tuning: each layer group gets different LR, (3) BERT-style: layer decay (lower layers = lower LR). For training from scratch, same LR usually works. But for fine-tuning, definitely use different LRs!


๐Ÿค“ Did You Know?

The Learning Rate was first identified as critical in the 1980s during early neural network research, but it became truly famous in 2012 when AlexNet won ImageNet. The team discovered that using LR=0.01 with momentum and dividing by 10 every 30 epochs was the secret sauce! Before that, researchers used constant LR and wondered why training was so unstable. Fun fact: Geoffrey Hinton (one of the godfathers of deep learning) once said "tuning the learning rate is 90% of deep learning"! The invention of Adam optimizer in 2014 by Kingma and Ba was revolutionary because it adapts the LR automatically for each parameter, but even Adam needs a good base LR! Modern breakthroughs like GPT and BERT all use warm-up schedules (invented around 2017) where LR starts at 0 and increases linearly for the first 1000-10000 steps. This prevents the "cold start problem" where early aggressive updates mess up initialization. Today, cosine annealing (smoothly decreasing LR following a cosine curve) is the most popular schedule, used by almost every state-of-the-art model! ๐ŸŽฏ๐Ÿ”ฅ๐Ÿš€


Thรฉo CHARLET

IT Systems & Networks Student - AI/ML Specialization

Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)

๐Ÿ”— LinkedIn: https://www.linkedin.com/in/thรฉo-charlet

๐Ÿš€ Seeking internship opportunities

๐Ÿ”— Website : https://rdtvlokip.fr

Community

Sign up or log in to comment