I. Why Order Matters — The Dependency Problem
Yesterday you called _backward() by hand, right-to-left, on each node. That works for 5 nodes.
It does not work for 5 million. The question today: can we find the correct backward order
automatically? The answer is topological sort — an algorithm that arranges nodes so every
parent is visited before its children. In graph theory, this is a solved problem. In neural networks, it is
the engine that makes training possible.
The Key Insight
Topological sort gives you the forward order. Reverse it and you get the backward order. This guarantees that when you compute a node’s gradient, all downstream gradients that feed into it have already been computed. The algorithm is a single DFS — ~15 lines of Python — and it is the entire “scheduler” behind PyTorch’s loss.backward().
II. The Algorithm — DFS with a Visited Set
The topological sort works by depth-first search. Start from the output node. Recurse into each child. After all children have been visited, add the current node to the list. The final list, reversed, gives the correct backward order. It is the same pattern as post-order traversal in tree algorithms.
III. Before & After — Manual vs. Automated
Marks writes that the value of a system lies in what it automates away. Yesterday you wrote 5 lines
of manual backward calls. Today, a single L.backward() replaces them all — and scales to
any graph size.
IV. The Full Value Class — Complete Engine
With topological sort, _backward closures from Day 2, and operator overloading, the Value class
becomes a complete autograd engine. This 40-line class is functionally equivalent to the core of
torch.autograd — just operating on scalars instead of tensors.
The Zero-Gradient Trap
If you call .backward() twice without resetting gradients, they accumulate. The second pass adds to the first pass’s gradients — giving you 2× the correct values. In real training loops, you must call optimizer.zero_grad() or manually zero .grad before each backward pass. PyTorch has the same behavior by design (it enables gradient accumulation for large batches).
V. The Matrix — What Matters Today
DO FIRST
Add backward() with topo sort to your Value class. Test it on the Day 2 expression. Verify all gradients match PyTorch.
DO IF TIME
Add __radd__, __rmul__, __neg__, __sub__ so you can write 2 * a and a - b naturally.
DO CAREFULLY
Add tanh() and exp() with correct _backward closures. These are the activation functions you’ll need for neurons tomorrow.
AVOID TODAY
Building neurons or training loops. The autograd engine must be bulletproof first. Tomorrow you build on top of it.
VI. Today’s Deliverables
- Topo sort: Implement
build_topo()using DFS with a visited set - backward(): Seed
self.grad = 1.0, iterate reversed topo, call each_backward() - One-line test:
L.backward()— verify all gradients match Day 2’s manual results exactly - Reuse test: Call
backward()twice — observe the accumulation bug, understand why - Operator overloading: Add
__radd__,__rmul__,__neg__,__sub__,__truediv__ - Activation: Implement
tanh()with correct backward:self.grad += (1 - t**2) * out.grad - PyTorch verify: Rebuild full expression in PyTorch, confirm every gradient matches
.backward() works, you never think about topological sort again —
just as a well-structured portfolio doesn’t require daily rebalancing. The infrastructure disappears, and you
focus on what matters: the architecture built on top of it. Tomorrow, that architecture is a neuron.
— Day 3 Closing Principle
Complete autograd engine: topological sort via DFS, automated backward(), the zero-gradient trap, and PyTorch comparison.