Optimisation Theory · Machine Learning Foundations · Learning Deep Dive
Gradient Descent
The algorithm that trains nearly every neural network alive. What it actually computes, where it silently breaks, what each variant fixes, and when to abandon it entirely.
Foundation Algorithm
Complexity: O(n) per step
Convergence: not guaranteed
Ubiquity: near-universal
Invented: Cauchy, 1847
Rediscovered: Rumelhart, 1986
1
What It Actually Computes
Strip away the metaphors — the precise mechanical operation
The textbook metaphor — a blindfolded hiker descending a mountain — is useful but dangerously incomplete. Here is what gradient descent does, stated precisely: it iteratively adjusts parameters in the direction that most steeply reduces the loss function, by an amount controlled by the learning rate.
The update rule is deceptively simple: θ ← θ − α · ∇J(θ). Three objects. The weight vector θ. The learning rate α. The gradient ∇J(θ) — the direction of steepest ascent, negated. The algorithm has no memory of where it has been. No map. No knowledge of the landscape's shape beyond the local slope at this exact point.
"Gradient descent is not a solver — it is a direction follower. It finds where the slope points down and takes a step. It has no concept of destination, only of local improvement. This is both its power and its fundamental limitation."
The Loss Landscape — Iterative Descent with Failure Regions Marked
1
Compute gradient ∇J(θ) — partial derivatives of loss w.r.t. every parameter. Points in the direction of steepest increase.
2
Negate and scale by α. Move opposite the gradient — downhill — by a controlled step size.
3
Repeat until gradient ≈ 0. That's a stationary point — but whether it's a minimum, saddle, or local trap is unknowable locally.
The loss landscape is non-convex in real neural networks — local minima, saddle points, and plateaus coexist. GD offers no guarantees about which stationary point it finds.
2
The Algorithm — Layers of Difficulty
Trivial formula, non-trivial implementation — where it actually breaks
The update rule fits on one napkin. Making gradient descent reliably train a modern model spans thousands of engineering decisions. Every major advance in deep learning — batch norm, residual connections, Adam, gradient clipping — is a targeted patch for a specific failure mode of vanilla GD.
Difficulty Stack — From Formula to Production-Grade Training
Update Rule
one line
θ ← θ − α∇J(θ)Any undergraduate implements this in NumPy in 5 minutes.TRIVIAL
Backpropagation
autograd / chain rule
Chain rule through the computation graph. PyTorch/JAX handle this.EASY W/ LIBRARIES
Mini-Batch SGD
stochastic variant
Compute gradient on a random batch of B samples. Introduces noise that helps escape local minima.MODERATE
LR Schedules
warmup, cosine, decay
A fixed learning rate almost never works for large models.NON-TRIVIAL
Gradient Pathologies
vanishing / exploding
In deep networks, gradients shrink to zero or explode to NaN. Requires clipping, normalization, and proper init.HARD
Ill-Conditioning
curvature mismatch
Loss surface has different curvature by direction; vanilla GD oscillates in ravines.HARD
3
Variant Analysis — What Each Fix Buys You
SGD → Momentum → RMSProp → Adam: the lineage of patches
Each major variant is a direct response to a specific failure mode. Every variant adds state — which adds memory cost but buys more intelligent navigation of the landscape.
If gradient descent is the algorithm, learning rate is the dial that determines whether it works. The right answer in modern training is usually a schedule, not a constant.
Learning Rate Spectrum — Divergence to Stagnation
Too Highα ≈ 0.1–1.0
Loss diverges or explodes.
Just Rightα ≈ 1e-3 to 3e-4
Smooth, stable descent.
Too Lowα ≈ 1e-6
Barely moves.
6
Where the Real Insight Lives
From obvious textbook facts to deeper intuition
Insight Pyramid
Update rule
θ ← θ − α∇J(θ)
OBVIOUS
Mini-batch noise helps generalisation
Noise acts as implicit regularisation.
KNOWN
GD has implicit bias toward simpler solutions
A deep reason it can generalise well remains an open research frontier.
DEEPEST
7
Implementation Complexity by Variant
How hard each variant is to implement correctly
Build Complexity — NumPy from Scratch
Variant
Implementation Effort
LoC
Hardest Step
Vanilla GD
trivial
~5
Nothing
Momentum
velocity state
~15
State update correctness
Adam / AdamW
bias correction + decoupled WD
~55
Numerical stability and decay handling
8
When to Abandon Gradient Descent
Use GD vs alternatives
Decision Table — Use GD or Alternatives?
Situation
Use GD?
Alternative
Reason
Training a neural network
✓ Yes
AdamW
Standard at scale.
Non-differentiable objective
✗ No
Evolutionary / Bayesian methods
No gradient exists.
Hyperparameter search
✗ No
Bayesian optimisation
Validation objective is not directly differentiable.