← Back to omkarray.com
Optimisation Theory · Machine Learning Foundations · Learning Deep Dive

Gradient
Descent

The algorithm that trains nearly every neural network alive. What it actually computes, where it silently breaks, what each variant fixes, and when to abandon it entirely.

Foundation Algorithm
Complexity: O(n) per step
Convergence: not guaranteed
Ubiquity: near-universal
Invented: Cauchy, 1847
Rediscovered: Rumelhart, 1986
1
What It Actually Computes
Strip away the metaphors — the precise mechanical operation

The textbook metaphor — a blindfolded hiker descending a mountain — is useful but dangerously incomplete. Here is what gradient descent does, stated precisely: it iteratively adjusts parameters in the direction that most steeply reduces the loss function, by an amount controlled by the learning rate.

The update rule is deceptively simple: θ ← θ − α · ∇J(θ). Three objects. The weight vector θ. The learning rate α. The gradient ∇J(θ) — the direction of steepest ascent, negated. The algorithm has no memory of where it has been. No map. No knowledge of the landscape's shape beyond the local slope at this exact point.

"Gradient descent is not a solver — it is a direction follower. It finds where the slope points down and takes a step. It has no concept of destination, only of local improvement. This is both its power and its fundamental limitation."
The Loss Landscape — Iterative Descent with Failure Regions Marked
−∇J θ₀ local min saddle global min θ* plateau region PARAMETER θ → LOSS J(θ) →
1
Compute gradient ∇J(θ) — partial derivatives of loss w.r.t. every parameter. Points in the direction of steepest increase.
2
Negate and scale by α. Move opposite the gradient — downhill — by a controlled step size.
3
Repeat until gradient ≈ 0. That's a stationary point — but whether it's a minimum, saddle, or local trap is unknowable locally.
The loss landscape is non-convex in real neural networks — local minima, saddle points, and plateaus coexist. GD offers no guarantees about which stationary point it finds.
2
The Algorithm — Layers of Difficulty
Trivial formula, non-trivial implementation — where it actually breaks

The update rule fits on one napkin. Making gradient descent reliably train a modern model spans thousands of engineering decisions. Every major advance in deep learning — batch norm, residual connections, Adam, gradient clipping — is a targeted patch for a specific failure mode of vanilla GD.

Difficulty Stack — From Formula to Production-Grade Training
Update Rule
one line
θ ← θ − α∇J(θ)Any undergraduate implements this in NumPy in 5 minutes.TRIVIAL
Backpropagation
autograd / chain rule
Chain rule through the computation graph. PyTorch/JAX handle this.EASY W/ LIBRARIES
Mini-Batch SGD
stochastic variant
Compute gradient on a random batch of B samples. Introduces noise that helps escape local minima.MODERATE
LR Schedules
warmup, cosine, decay
A fixed learning rate almost never works for large models.NON-TRIVIAL
Gradient Pathologies
vanishing / exploding
In deep networks, gradients shrink to zero or explode to NaN. Requires clipping, normalization, and proper init.HARD
Ill-Conditioning
curvature mismatch
Loss surface has different curvature by direction; vanilla GD oscillates in ravines.HARD
3
Variant Analysis — What Each Fix Buys You
SGD → Momentum → RMSProp → Adam: the lineage of patches

Each major variant is a direct response to a specific failure mode. Every variant adds state — which adds memory cost but buys more intelligent navigation of the landscape.

Variant Comparison — Formulas and Trade-offs
Vanilla SGD
θ ← θ − α · ∇J(θ_batch)
Stochastic mini-batch updates
Oscillates in ravines
SGD + Momentum
v ← βv + α·∇J(θ)
θ ← θ − v
Smooths oscillation
Faster on plateaus
RMSProp
E[g²] ← βE[g²]+(1−β)g²
θ ← θ − α/√(E[g²]+ε)·g
Per-parameter adaptive LR
Adam
m̂, v̂ bias-corrected moments
θ ← θ − α·m̂/√(v̂+ε)
Industry default for transformers
4
Failure Mode Map
Impact × Frequency map for practical training
Risk Matrix — GD Failure Modes
High Impact · High Frequency
Vanishing gradients
Bad learning rate
Exploding gradients
High Impact · Lower Frequency
Local minima traps
Sharp minima
Low Impact · High Frequency
Saddle point slowdown
Plateau regions
Low Impact · Low Frequency
Ravine oscillation
Float underflow
5
The Learning Rate Spectrum
The single most impactful hyperparameter

If gradient descent is the algorithm, learning rate is the dial that determines whether it works. The right answer in modern training is usually a schedule, not a constant.

Learning Rate Spectrum — Divergence to Stagnation
Too Highα ≈ 0.1–1.0
Loss diverges or explodes.
Just Rightα ≈ 1e-3 to 3e-4
Smooth, stable descent.
Too Lowα ≈ 1e-6
Barely moves.
6
Where the Real Insight Lives
From obvious textbook facts to deeper intuition
Insight Pyramid
Update rule
θ ← θ − α∇J(θ)
OBVIOUS
Mini-batch noise helps generalisation
Noise acts as implicit regularisation.
KNOWN
GD has implicit bias toward simpler solutions
A deep reason it can generalise well remains an open research frontier.
DEEPEST
7
Implementation Complexity by Variant
How hard each variant is to implement correctly
Build Complexity — NumPy from Scratch
VariantImplementation EffortLoCHardest Step
Vanilla GD
trivial
~5Nothing
Momentum
velocity state
~15State update correctness
Adam / AdamW
bias correction + decoupled WD
~55Numerical stability and decay handling
8
When to Abandon Gradient Descent
Use GD vs alternatives
Decision Table — Use GD or Alternatives?
SituationUse GD?AlternativeReason
Training a neural network✓ YesAdamWStandard at scale.
Non-differentiable objective✗ NoEvolutionary / Bayesian methodsNo gradient exists.
Hyperparameter search✗ NoBayesian optimisationValidation objective is not directly differentiable.
Very small datasets⚠ MaybeL-BFGS / NewtonSecond-order methods can converge faster.

Key references: Cauchy (1847) · Rumelhart, Hinton & Williams (1986) · AdaGrad (2011) · Adam (2014) · AdamW (2019) · Double Descent (2019) · Spectral Bias (2019) · Lion (2023). Learning foundations volume. Feb 2026.