1 2 3 4 · · · 80
← Back to index
PHASE 1 Foundations · Day 3 of 80 · Neural Networks & Backprop

Topological Sort & the Full Autograd Engine

Automate the backward pass. Build the engine that knows the right order to propagate gradients — so you never call _backward() by hand again.

You can’t assess risk by only looking at the present. You have to trace the chain of dependencies — which dominos fall first, which fall last, and in what order. A topological sort does exactly this for a computation graph: it finds the one correct ordering in which every node’s inputs are resolved before the node itself. Get the order wrong, and the entire system produces garbage. — Day 3 Principle, adapted from the Marks framework

I. Why Order Matters — The Dependency Problem

Yesterday you called _backward() by hand, right-to-left, on each node. That works for 5 nodes. It does not work for 5 million. The question today: can we find the correct backward order automatically? The answer is topological sort — an algorithm that arranges nodes so every parent is visited before its children. In graph theory, this is a solved problem. In neural networks, it is the engine that makes training possible.

Exhibit A — Forward Order vs. Backward Order (Topological Sort)
FORWARD ORDER (left → right): a, b, c → d → e → L a b × d c + L BACKWARD ORDER (reverse topological): L → e → d, c → a, b L: 1.0 d: 1.0 c: 1.0 a: −3.0 b: 2.0 step 1 step 2 step 3

The Key Insight

Topological sort gives you the forward order. Reverse it and you get the backward order. This guarantees that when you compute a node’s gradient, all downstream gradients that feed into it have already been computed. The algorithm is a single DFS — ~15 lines of Python — and it is the entire “scheduler” behind PyTorch’s loss.backward().

II. The Algorithm — DFS with a Visited Set

The topological sort works by depth-first search. Start from the output node. Recurse into each child. After all children have been visited, add the current node to the list. The final list, reversed, gives the correct backward order. It is the same pattern as post-order traversal in tree algorithms.

def backward(self): # Step 1: Build topological order via DFS topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._prev: build_topo(child) topo.append(v) # post-order: add AFTER children build_topo(self) # Step 2: Seed and propagate self.grad = 1.0 for v in reversed(topo): # reversed = backward order v._backward()

III. Before & After — Manual vs. Automated

Marks writes that the value of a system lies in what it automates away. Yesterday you wrote 5 lines of manual backward calls. Today, a single L.backward() replaces them all — and scales to any graph size.

✗ Day 2: Manual
✓ Day 3: Automated
L.grad = 1.0
L._backward()
d._backward()
c._backward()
a._backward() # fragile!
L.backward() # that’s it.

# handles any graph
# any depth
# correct order always

IV. The Full Value Class — Complete Engine

With topological sort, _backward closures from Day 2, and operator overloading, the Value class becomes a complete autograd engine. This 40-line class is functionally equivalent to the core of torch.autograd — just operating on scalars instead of tensors.

class Value: def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 self._backward = lambda: None self._prev = set(_children) self._op = _op def __add__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data + other.data, (self, other), '+') def _backward(): self.grad += out.grad other.grad += out.grad out._backward = _backward return out def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data * other.data, (self, other), '*') def _backward(): self.grad += other.data * out.grad other.grad += self.data * out.grad out._backward = _backward return out def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._prev: build_topo(child) topo.append(v) build_topo(self) self.grad = 1.0 for v in reversed(topo): v._backward()

The Zero-Gradient Trap

If you call .backward() twice without resetting gradients, they accumulate. The second pass adds to the first pass’s gradients — giving you 2× the correct values. In real training loops, you must call optimizer.zero_grad() or manually zero .grad before each backward pass. PyTorch has the same behavior by design (it enables gradient accumulation for large batches).

V. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Add backward() with topo sort to your Value class. Test it on the Day 2 expression. Verify all gradients match PyTorch.

⏭️

DO IF TIME

Add __radd__, __rmul__, __neg__, __sub__ so you can write 2 * a and a - b naturally.

Slow but Worth It
🖐

DO CAREFULLY

Add tanh() and exp() with correct _backward closures. These are the activation functions you’ll need for neurons tomorrow.

🚫

AVOID TODAY

Building neurons or training loops. The autograd engine must be bulletproof first. Tomorrow you build on top of it.

VI. Today’s Deliverables

The best systems are invisible. Once .backward() works, you never think about topological sort again — just as a well-structured portfolio doesn’t require daily rebalancing. The infrastructure disappears, and you focus on what matters: the architecture built on top of it. Tomorrow, that architecture is a neuron. — Day 3 Closing Principle
Day 3 Notebook — Topological Sort & Full Autograd Runnable Python

Complete autograd engine: topological sort via DFS, automated backward(), the zero-gradient trap, and PyTorch comparison.