1 2 3 4 · · · 80
← Back to index
PHASE 1 Foundations · Day 2 of 80 · Neural Networks & Backprop

The Backward Pass — How Blame Flows

Implement the chain rule. Compute gradients. Learn how a neural network traces responsibility backwards from output to every input.

Risk means more things can happen than will happen. A gradient is the mirror image of this: it tells you, of all the ways this output could have been different, exactly how much each input is responsible. The backward pass is a system for attributing blame with mathematical precision. — Day 2 Principle, adapted from the Marks framework

I. The Two Directions — Data Down, Gradients Up

Think of the computation graph as a river system. In the forward pass, data flows downstream — from inputs to output, accumulating through operations. In the backward pass, gradients flow upstream — from the output back to each input, splitting at every junction. The question each gradient answers: “How much of the final result is my fault?”

Exhibit A — The Same Expression, Two Perspectives
▶ FORWARD: COMPUTE VALUES (LEFT → RIGHT) a = 2.0 leaf input b = −3.0 leaf input × d = −6.0 a × b c = 10.0 leaf input + L = 4.0 d + c = output ◀ BACKWARD: COMPUTE GRADIENTS (RIGHT → LEFT) L.grad = 1.0 seed: dL/dL = 1 + d.grad = 1.0 c.grad = 1.0 × a.grad = −3.0 b.grad = 2.0 + distributes: both get G × swaps: a gets b.data×G, b gets a.data×G EACH GRADIENT = "IF I NUDGE THIS INPUT BY ε, HOW MUCH DOES L CHANGE?"

II. The Only Two Local Rules You Need

Every backward pass, no matter how deep the network, decomposes into just two local operations applied in sequence. Master these and you understand all of backpropagation.

Exhibit B — The Two Gradient Rules, Contrasted
The Distributor c = a + b G flows in + a: G b: G ∂(a+b)/∂a = 1 → a.grad += 1 × G ∂(a+b)/∂b = 1 → b.grad += 1 × G Both inputs receive the full upstream gradient. Addition is a gradient “splitter” — it copies G to both sides. The Swapper c = a × b G flows in × a: bG b: aG swap! ∂(a×b)/∂a = b → a.grad += b.data × G ∂(a×b)/∂b = a → b.grad += a.data × G Each input receives the other’s value × upstream gradient. Multiplication crosses the wires — a gets b’s value, b gets a’s.

Why Only Two Rules?

Karpathy’s micrograd starts with just + and × because every other operation can be built from them. Subtraction is a + (−1 × b). Division and exponentiation come later. If you deeply understand these two gradient rules and the chain rule, you understand the backward pass for any expression.

III. Step-by-Step Waterfall — Forward Mirrors Backward

Howard Marks teaches that risk analysis means tracing cause and effect through a chain. The waterfall below shows how the forward and backward passes mirror each other — one builds the expression, the other unwinds it.

▶ Forward Pass (compute data)
Step 1
a = 2.0, b = -3.0, c = 10.0
Initialize leaf values
Step 2
d = a × b = −6.0
Multiply, store children=(a,b), op='*'
Step 3
L = d + c = 4.0
Add, store children=(d,c), op='+'
◀ Backward Pass (compute grads)
Step 1
L.grad = 1.0
Seed: “how much does L affect L?” → 1
Step 2
d.grad += 1.0  |  c.grad += 1.0
+ distributes upstream grad unchanged
Step 3
a.grad += (−3)×1  |  b.grad += (2)×1
× swaps: a gets b’s value, b gets a’s value

IV. The Implementation — Hands-On Notebook

All the code for today lives in a clean, runnable Jupyter notebook. It covers: the enhanced Value class with _backward, manual gradient propagation step-by-step, PyTorch verification, the += accumulation trap, and a complex multi-branch expression challenge.

📓

day2_backward_pass.ipynb

6 sections · Value class + _backward · Manual backprop · PyTorch verification · Exercises

V. The Gradient Thermometer — Reading the Results

Marks reads a risk report by asking “what does this number mean for my decision?” Do the same with gradients. Each one has a direction (positive = increase L, negative = decrease L) and a magnitude (how strongly this input influences the output).

Exhibit C — Gradient Magnitudes for L = (a × b) + c
a.grad
−3.0
Strongest influence, negative — increase a → L drops
b.grad
+2.0
Moderate influence, positive — increase b → L rises
c.grad
+1.0
Weakest influence — c passes directly through +
L.grad
1.0
Always 1.0 — the output’s effect on itself

Reading this chart: If you were training a neural network and wanted to increase L, you’d nudge b upward (positive gradient) and a downward (negative gradient). The magnitudes tell you how much each nudge matters — a has 3× more leverage than c.

The += Accumulation Trap

When the same variable appears twice (e.g., L = a + a), its gradient is 2.0, not 1.0. Each usage of a contributes +1 to the gradient, and these accumulate. Using = instead of += silently overwrites the first contribution — producing wrong gradients that are extremely hard to debug. The notebook covers this in Part 5.

VI. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Open the notebook. Add _backward to __add__ and __mul__. Run the manual backward pass. Verify every gradient against PyTorch.

⏭️

DO IF TIME

Watch 3Blue1Brown’s chain rule video. Good visual intuition but you’ll get more from coding it yourself first.

Slow but Worth It
🖐

DO CAREFULLY

Part 6 of the notebook: L = (a*b + c) * (a + b). Draw it on paper first, predict gradients, then verify. This is where real understanding forms.

🚫

AVOID TODAY

Implementing topological sort or the full .backward() automation. That’s Day 3. Today is strictly about understanding local gradient rules.

VII. Today’s Deliverables

The backward pass is not a new algorithm — it’s the chain rule you learned in calculus, applied systematically across a graph. Its power doesn’t come from complexity. It comes from automation across depth: the same two local rules, applied hundreds of times, train networks with billions of parameters. Today you learn the rules. Tomorrow, you automate them. — Day 2 Closing Principle
Day 2 Notebook — The Backward Pass & Chain Rule Runnable Python

Implements backpropagation with _backward closures, manual gradient computation, gradient accumulation demo, and PyTorch verification.