Day 3 — Topological Sort & Full Autograd

You can’t assess risk by only looking at the present. You have to trace the chain of dependencies — which dominos fall first, which fall last, and in what order. A topological sort does exactly this for a computation graph: it finds the one correct ordering in which every node’s inputs are resolved before the node itself. Get the order wrong, and the entire system produces garbage. — Day 3 Principle, adapted from the Marks framework

I. Why Order Matters — The Dependency Problem

Yesterday you called _backward() by hand, right-to-left, on each node. That works for 5 nodes. It does not work for 5 million. The question today: can we find the correct backward order automatically? The answer is topological sort — an algorithm that arranges nodes so every parent is visited before its children. In graph theory, this is a solved problem. In neural networks, it is the engine that makes training possible.

Exhibit A — Forward Order vs. Backward Order (Topological Sort)

The Key Insight

Topological sort gives you the forward order. Reverse it and you get the backward order. This guarantees that when you compute a node’s gradient, all downstream gradients that feed into it have already been computed. The algorithm is a single DFS — ~15 lines of Python — and it is the entire “scheduler” behind PyTorch’s loss.backward().

II. The Algorithm — DFS with a Visited Set

The topological sort works by depth-first search. Start from the output node. Recurse into each child. After all children have been visited, add the current node to the list. The final list, reversed, gives the correct backward order. It is the same pattern as post-order traversal in tree algorithms.

def backward(self):
    # Step 1: Build topological order via DFS
    topo = []
    visited = set()

    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)        # post-order: add AFTER children

    build_topo(self)

    # Step 2: Seed and propagate
    self.grad = 1.0
    for v in reversed(topo):   # reversed = backward order
        v._backward()
  

III. Before & After — Manual vs. Automated

Marks writes that the value of a system lies in what it automates away. Yesterday you wrote 5 lines of manual backward calls. Today, a single L.backward() replaces them all — and scales to any graph size.

✗ Day 2: Manual

✓ Day 3: Automated

      L.grad = 1.0

      L._backward()

      d._backward()

      c._backward()

      a._backward() # fragile!

      L.backward() # that’s it.

      # handles any graph

      # any depth

      # correct order always

IV. The Full Value Class — Complete Engine

With topological sort, _backward closures from Day 2, and operator overloading, the Value class becomes a complete autograd engine. This 40-line class is functionally equivalent to the core of torch.autograd — just operating on scalars instead of tensors.

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()
  

The Zero-Gradient Trap

If you call .backward() twice without resetting gradients, they accumulate. The second pass adds to the first pass’s gradients — giving you 2× the correct values. In real training loops, you must call optimizer.zero_grad() or manually zero .grad before each backward pass. PyTorch has the same behavior by design (it enables gradient accumulation for large batches).

V. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Add backward() with topo sort to your Value class. Test it on the Day 2 expression. Verify all gradients match PyTorch.

⏭️

DO IF TIME

Add __radd__, __rmul__, __neg__, __sub__ so you can write 2 * a and a - b naturally.

Slow but Worth It

🖐

DO CAREFULLY

Add tanh() and exp() with correct _backward closures. These are the activation functions you’ll need for neurons tomorrow.

🚫

AVOID TODAY

Building neurons or training loops. The autograd engine must be bulletproof first. Tomorrow you build on top of it.

VI. Today’s Deliverables

Topo sort: Implement build_topo() using DFS with a visited set
backward(): Seed self.grad = 1.0, iterate reversed topo, call each _backward()
One-line test: L.backward() — verify all gradients match Day 2’s manual results exactly
Reuse test: Call backward() twice — observe the accumulation bug, understand why
Operator overloading: Add __radd__, __rmul__, __neg__, __sub__, __truediv__
Activation: Implement tanh() with correct backward: self.grad += (1 - t**2) * out.grad
PyTorch verify: Rebuild full expression in PyTorch, confirm every gradient matches

The best systems are invisible. Once .backward() works, you never think about topological sort again — just as a well-structured portfolio doesn’t require daily rebalancing. The infrastructure disappears, and you focus on what matters: the architecture built on top of it. Tomorrow, that architecture is a neuron. — Day 3 Closing Principle

Day 3 Notebook — Topological Sort & Full Autograd Runnable Python

Complete autograd engine: topological sort via DFS, automated backward(), the zero-gradient trap, and PyTorch comparison.

▶ Open in Colab View on GitHub nbviewer