Day 2 — The Backward Pass & Chain Rule

Risk means more things can happen than will happen. A gradient is the mirror image of this: it tells you, of all the ways this output could have been different, exactly how much each input is responsible. The backward pass is a system for attributing blame with mathematical precision. — Day 2 Principle, adapted from the Marks framework

I. The Two Directions — Data Down, Gradients Up

Think of the computation graph as a river system. In the forward pass, data flows downstream — from inputs to output, accumulating through operations. In the backward pass, gradients flow upstream — from the output back to each input, splitting at every junction. The question each gradient answers: “How much of the final result is my fault?”

Exhibit A — The Same Expression, Two Perspectives

II. The Only Two Local Rules You Need

Every backward pass, no matter how deep the network, decomposes into just two local operations applied in sequence. Master these and you understand all of backpropagation.

Exhibit B — The Two Gradient Rules, Contrasted

Why Only Two Rules?

Karpathy’s micrograd starts with just + and × because every other operation can be built from them. Subtraction is a + (−1 × b). Division and exponentiation come later. If you deeply understand these two gradient rules and the chain rule, you understand the backward pass for any expression.

III. Step-by-Step Waterfall — Forward Mirrors Backward

Howard Marks teaches that risk analysis means tracing cause and effect through a chain. The waterfall below shows how the forward and backward passes mirror each other — one builds the expression, the other unwinds it.

▶ Forward Pass (compute data)

Step 1

a = 2.0, b = -3.0, c = 10.0

Initialize leaf values

Step 2

d = a × b = −6.0

Multiply, store children=(a,b), op='*'

Step 3

L = d + c = 4.0

Add, store children=(d,c), op='+'

◀ Backward Pass (compute grads)

Step 1

L.grad = 1.0

Seed: “how much does L affect L?” → 1

Step 2

d.grad += 1.0 | c.grad += 1.0

+ distributes upstream grad unchanged

Step 3

a.grad += (−3)×1 | b.grad += (2)×1

× swaps: a gets b’s value, b gets a’s value

IV. The Implementation — Hands-On Notebook

All the code for today lives in a clean, runnable Jupyter notebook. It covers: the enhanced Value class with _backward, manual gradient propagation step-by-step, PyTorch verification, the += accumulation trap, and a complex multi-branch expression challenge.

📓

day2_backward_pass.ipynb

6 sections · Value class + _backward · Manual backprop · PyTorch verification · Exercises

→

V. The Gradient Thermometer — Reading the Results

Marks reads a risk report by asking “what does this number mean for my decision?” Do the same with gradients. Each one has a direction (positive = increase L, negative = decrease L) and a magnitude (how strongly this input influences the output).

Exhibit C — Gradient Magnitudes for L = (a × b) + c

a.grad

−3.0

Strongest influence, negative — increase a → L drops

b.grad

+2.0

Moderate influence, positive — increase b → L rises

c.grad

+1.0

Weakest influence — c passes directly through +

L.grad

1.0

Always 1.0 — the output’s effect on itself

Reading this chart: If you were training a neural network and wanted to increase L, you’d nudge b upward (positive gradient) and a downward (negative gradient). The magnitudes tell you how much each nudge matters — a has 3× more leverage than c.

The += Accumulation Trap

When the same variable appears twice (e.g., L = a + a), its gradient is 2.0, not 1.0. Each usage of a contributes +1 to the gradient, and these accumulate. Using = instead of += silently overwrites the first contribution — producing wrong gradients that are extremely hard to debug. The notebook covers this in Part 5.

VI. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Open the notebook. Add _backward to __add__ and __mul__. Run the manual backward pass. Verify every gradient against PyTorch.

⏭️

DO IF TIME

Watch 3Blue1Brown’s chain rule video. Good visual intuition but you’ll get more from coding it yourself first.

Slow but Worth It

🖐

DO CAREFULLY

Part 6 of the notebook: L = (a*b + c) * (a + b). Draw it on paper first, predict gradients, then verify. This is where real understanding forms.

🚫

AVOID TODAY

Implementing topological sort or the full .backward() automation. That’s Day 3. Today is strictly about understanding local gradient rules.

VII. Today’s Deliverables

_backward for +: Both inputs get += out.grad × 1.0
_backward for ×: Each input gets += other.data × out.grad (the swap rule)
Manual backward: Seed L.grad = 1.0, call _backward() right-to-left, get all gradients
PyTorch verify: Rebuild expression with requires_grad=True — all gradients must match exactly
+= trap: Test a + a — confirm a.grad == 2.0, understand why
Complex test: Compute gradients for (a*b + c) * (a + b) on paper, then verify in notebook

The backward pass is not a new algorithm — it’s the chain rule you learned in calculus, applied systematically across a graph. Its power doesn’t come from complexity. It comes from automation across depth: the same two local rules, applied hundreds of times, train networks with billions of parameters. Today you learn the rules. Tomorrow, you automate them. — Day 2 Closing Principle

Day 2 Notebook — The Backward Pass & Chain Rule Runnable Python

Implements backpropagation with _backward closures, manual gradient computation, gradient accumulation demo, and PyTorch verification.

▶ Open in Colab View on GitHub nbviewer