I. Why Manual Backprop Matters
PyTorch’s autograd computes gradients automatically. So why do it by hand? Because understanding beats automation. When your model doesn’t converge, when gradients explode, when loss plateaus — you need to reason about gradient flow. That requires knowing exactly how gradients propagate through matmul, tanh, softmax, cross-entropy, and batch normalization.
II. Gradient Through Matrix Multiply
III. Gradient Through BatchNorm
The Verification Pattern
After computing each gradient manually, compare it to PyTorch’s autograd result: torch.allclose(dW_manual, W.grad, atol=1e-5). If they don’t match, your manual derivation has a bug. This is the gold standard for verifying gradient correctness.
IV. The Matrix — What Matters Today
DO FIRST
Manually backprop through the forward pass: embedding lookup → matmul → BatchNorm → tanh → matmul → cross_entropy.
DO IF TIME
Verify every manual gradient against torch.autograd using allclose.
DO CAREFULLY
Derive the cross-entropy gradient on paper first. The math is cleaner than you’d expect: softmax(logits) - one_hot(target).
AVOID TODAY
Skipping any operation. Every single gradient must be computed manually. No shortcuts.
V. Today’s Deliverables
- Matmul gradient: Derive and implement
dW = X.T @ dh,dX = dh @ W.T - BatchNorm gradient: Full manual backward through normalize, gamma, beta
- Cross-entropy gradient: Derive the elegant
softmax - one_hotform - Tanh gradient:
dx = (1 - tanh(x)**2) * dout - Verification: Every manual gradient matches
torch.autogradwithin1e-5 - Full chain: Manual backward pass through the entire makemore MLP
Hand-computed gradients for matrix multiply, batch normalization, and cross-entropy — all verified against PyTorch autograd.