Day 13 — Becoming a Backprop Ninja — Manual Tensor Backprop

The investor who truly understands derivatives pricing can price any exotic instrument from first principles. The engineer who can manually backpropagate through any operation has the same superpower: nothing in the automatic system is a black box. Today you earn that power.— Day 13 Principle, adapted from the Marks framework

I. Why Manual Backprop Matters

PyTorch’s autograd computes gradients automatically. So why do it by hand? Because understanding beats automation. When your model doesn’t converge, when gradients explode, when loss plateaus — you need to reason about gradient flow. That requires knowing exactly how gradients propagate through matmul, tanh, softmax, cross-entropy, and batch normalization.

II. Gradient Through Matrix Multiply

# Forward: h = emb @ W + b
# Backward (given dL/dh = dh):
demb = dh @ W.T           # [B, emb_dim]
dW   = emb.T @ dh          # [emb_dim, hidden]
db   = dh.sum(0)            # [hidden]
  

III. Gradient Through BatchNorm

# Forward: xhat = (x - mean) / sqrt(var + eps)
#          out  = gamma * xhat + beta
# Backward (given dout):
dgamma = (dout * xhat).sum(0)
dbeta  = dout.sum(0)
dxhat  = dout * gamma
# dvar, dmean, dx follow chain rule through normalization
n = x.shape[0]
dvar  = (dxhat * (x - mean) * (-0.5) * (var + eps)**(-1.5)).sum(0)
dmean = (dxhat * (-1) / torch.sqrt(var + eps)).sum(0)
dx    = dxhat / torch.sqrt(var+eps) + dvar*2*(x-mean)/n + dmean/n
  

The Verification Pattern

After computing each gradient manually, compare it to PyTorch’s autograd result: torch.allclose(dW_manual, W.grad, atol=1e-5). If they don’t match, your manual derivation has a bug. This is the gold standard for verifying gradient correctness.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Manually backprop through the forward pass: embedding lookup → matmul → BatchNorm → tanh → matmul → cross_entropy.

⏭️

DO IF TIME

Verify every manual gradient against torch.autograd using allclose.

Slow but Worth It

🖐

DO CAREFULLY

Derive the cross-entropy gradient on paper first. The math is cleaner than you’d expect: softmax(logits) - one_hot(target).

🚫

AVOID TODAY

Skipping any operation. Every single gradient must be computed manually. No shortcuts.

V. Today’s Deliverables

Matmul gradient: Derive and implement dW = X.T @ dh, dX = dh @ W.T
BatchNorm gradient: Full manual backward through normalize, gamma, beta
Cross-entropy gradient: Derive the elegant softmax - one_hot form
Tanh gradient: dx = (1 - tanh(x)**2) * dout
Verification: Every manual gradient matches torch.autograd within 1e-5
Full chain: Manual backward pass through the entire makemore MLP

You are now a backprop ninja. You can trace gradient flow through any operation, diagnose where gradients vanish or explode, and verify any autograd implementation. This skill is rare and invaluable. Tomorrow: cross-entropy and softmax in depth.— Day 13 Closing Principle

Day 13 Notebook — Manual Tensor Backprop Runnable Python

Hand-computed gradients for matrix multiply, batch normalization, and cross-entropy — all verified against PyTorch autograd.

▶ Open in Colab View on GitHub nbviewer