I. The Two Directions — Data Down, Gradients Up
Think of the computation graph as a river system. In the forward pass, data flows downstream — from inputs to output, accumulating through operations. In the backward pass, gradients flow upstream — from the output back to each input, splitting at every junction. The question each gradient answers: “How much of the final result is my fault?”
II. The Only Two Local Rules You Need
Every backward pass, no matter how deep the network, decomposes into just two local operations applied in sequence. Master these and you understand all of backpropagation.
Why Only Two Rules?
Karpathy’s micrograd starts with just + and × because every other operation can be built from them. Subtraction is a + (−1 × b). Division and exponentiation come later. If you deeply understand these two gradient rules and the chain rule, you understand the backward pass for any expression.
III. Step-by-Step Waterfall — Forward Mirrors Backward
Howard Marks teaches that risk analysis means tracing cause and effect through a chain. The waterfall below shows how the forward and backward passes mirror each other — one builds the expression, the other unwinds it.
IV. The Implementation — Hands-On Notebook
All the code for today lives in a clean, runnable Jupyter notebook. It covers:
the enhanced Value class with _backward, manual gradient propagation step-by-step,
PyTorch verification, the += accumulation trap, and a complex multi-branch expression challenge.
day2_backward_pass.ipynb
6 sections · Value class + _backward · Manual backprop · PyTorch verification · Exercises
V. The Gradient Thermometer — Reading the Results
Marks reads a risk report by asking “what does this number mean for my decision?” Do the same with gradients. Each one has a direction (positive = increase L, negative = decrease L) and a magnitude (how strongly this input influences the output).
Reading this chart:
If you were training a neural network and wanted to increase L,
you’d nudge b upward (positive gradient) and a downward (negative gradient).
The magnitudes tell you how much each nudge matters — a has 3× more leverage than c.
The += Accumulation Trap
When the same variable appears twice (e.g., L = a + a), its gradient is 2.0, not 1.0. Each usage of a contributes +1 to the gradient, and these accumulate. Using = instead of += silently overwrites the first contribution — producing wrong gradients that are extremely hard to debug. The notebook covers this in Part 5.
VI. The Matrix — What Matters Today
DO FIRST
Open the notebook. Add _backward to __add__ and __mul__. Run the manual backward pass. Verify every gradient against PyTorch.
DO IF TIME
Watch 3Blue1Brown’s chain rule video. Good visual intuition but you’ll get more from coding it yourself first.
DO CAREFULLY
Part 6 of the notebook: L = (a*b + c) * (a + b). Draw it on paper first, predict gradients, then verify. This is where real understanding forms.
AVOID TODAY
Implementing topological sort or the full .backward() automation. That’s Day 3. Today is strictly about understanding local gradient rules.
VII. Today’s Deliverables
- _backward for +: Both inputs get
+= out.grad × 1.0 - _backward for ×: Each input gets
+= other.data × out.grad(the swap rule) - Manual backward: Seed
L.grad = 1.0, call_backward()right-to-left, get all gradients - PyTorch verify: Rebuild expression with
requires_grad=True— all gradients must match exactly - += trap: Test
a + a— confirma.grad == 2.0, understand why - Complex test: Compute gradients for
(a*b + c) * (a + b)on paper, then verify in notebook
Implements backpropagation with _backward closures, manual gradient computation, gradient accumulation demo, and PyTorch verification.