1-12 13 14 ··· 80
← Back to index
PHASE 2 Deep Networks · Day 13 of 80 · makemore & GPT

Becoming a Backprop Ninja — Manual Tensor Backprop

Compute gradients by hand through every operation: matrix multiply, batch norm, cross-entropy. No autograd allowed.

The investor who truly understands derivatives pricing can price any exotic instrument from first principles. The engineer who can manually backpropagate through any operation has the same superpower: nothing in the automatic system is a black box. Today you earn that power.— Day 13 Principle, adapted from the Marks framework

I. Why Manual Backprop Matters

PyTorch’s autograd computes gradients automatically. So why do it by hand? Because understanding beats automation. When your model doesn’t converge, when gradients explode, when loss plateaus — you need to reason about gradient flow. That requires knowing exactly how gradients propagate through matmul, tanh, softmax, cross-entropy, and batch normalization.

II. Gradient Through Matrix Multiply

# Forward: h = emb @ W + b # Backward (given dL/dh = dh): demb = dh @ W.T # [B, emb_dim] dW = emb.T @ dh # [emb_dim, hidden] db = dh.sum(0) # [hidden]

III. Gradient Through BatchNorm

# Forward: xhat = (x - mean) / sqrt(var + eps) # out = gamma * xhat + beta # Backward (given dout): dgamma = (dout * xhat).sum(0) dbeta = dout.sum(0) dxhat = dout * gamma # dvar, dmean, dx follow chain rule through normalization n = x.shape[0] dvar = (dxhat * (x - mean) * (-0.5) * (var + eps)**(-1.5)).sum(0) dmean = (dxhat * (-1) / torch.sqrt(var + eps)).sum(0) dx = dxhat / torch.sqrt(var+eps) + dvar*2*(x-mean)/n + dmean/n

The Verification Pattern

After computing each gradient manually, compare it to PyTorch’s autograd result: torch.allclose(dW_manual, W.grad, atol=1e-5). If they don’t match, your manual derivation has a bug. This is the gold standard for verifying gradient correctness.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Manually backprop through the forward pass: embedding lookup → matmul → BatchNorm → tanh → matmul → cross_entropy.

⏭️

DO IF TIME

Verify every manual gradient against torch.autograd using allclose.

Slow but Worth It
🖐

DO CAREFULLY

Derive the cross-entropy gradient on paper first. The math is cleaner than you’d expect: softmax(logits) - one_hot(target).

🚫

AVOID TODAY

Skipping any operation. Every single gradient must be computed manually. No shortcuts.

V. Today’s Deliverables

You are now a backprop ninja. You can trace gradient flow through any operation, diagnose where gradients vanish or explode, and verify any autograd implementation. This skill is rare and invaluable. Tomorrow: cross-entropy and softmax in depth.— Day 13 Closing Principle
Day 13 Notebook — Manual Tensor Backprop Runnable Python

Hand-computed gradients for matrix multiply, batch normalization, and cross-entropy — all verified against PyTorch autograd.