1 2 3 4 5 6 · · · 80
← Back to index
PHASE 1 Foundations · Day 5 of 80 · Neural Networks & Backprop

Training the MLP — Gradient Descent in Action

Close the loop: define a loss, compute gradients, update weights. Watch the network learn in real time.

In investing, the feedback loop is slow: you place a bet, wait months, and observe the outcome. In machine learning, the loop is immediate: forward pass → loss → backward pass → update → repeat. Gradient descent is the fastest feedback loop in all of applied mathematics. Today you feel it run. — Day 5 Principle, adapted from the Marks framework

I. The Training Loop — Four Steps, Infinite Repetition

Every neural network training loop in history — from a 41-parameter micrograd MLP to a trillion-parameter foundation model — follows the same four steps: (1) forward pass, (2) compute loss, (3) backward pass, (4) update parameters. Everything else is optimization around this core.

Exhibit A — The Training Loop: Forward → Loss → Backward → Update
1. FORWARD y_pred = model(x) 2. LOSS L = Σ(y-ŷ)² 3. BACKWARD L.backward() 4. UPDATE p -= lr * p.grad REPEAT (zero gradients first)

II. The Code — Training a Toy MLP

We define a tiny dataset of 4 samples (Karpathy’s classic example), create an MLP, and run the training loop. In ~20 steps, loss drops from ∼5.0 to near zero. The network learns.

# Tiny dataset: 4 input-output pairs xs = [ [2.0, 3.0, -1.0], [3.0, -1.0, 0.5], [0.5, 1.0, 1.0], [1.0, 1.0, -1.0], ] ys = [1.0, -1.0, -1.0, 1.0] # desired targets model = MLP(3, [4, 4, 1]) for k in range(20): # 1. Forward pass ypred = [model(x) for x in xs] # 2. Compute loss (mean squared error) loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) # 3. Backward pass for p in model.parameters(): p.grad = 0.0 # zero grads first! loss.backward() # 4. Update parameters for p in model.parameters(): p.data += -0.05 * p.grad # lr = 0.05 print(f"step {k}: loss = {loss.data:.4f}")

III. Watching the Loss Fall — The Descent Curve

Exhibit B — Loss vs. Training Step (Schematic)
LOSS TRAINING STEP loss ≈ 5.0 loss ≈ 0.01 step 5 step 10 Steep early → gradual convergence The “hockey stick” of gradient descent

Learning Rate: The One Hyperparameter That Matters Most

Too high (>0.1) and loss oscillates or diverges. Too low (<0.001) and training crawls. The sweet spot for this toy MLP is lr ≈ 0.01–0.1. In real networks, learning rate schedules (warmup + cosine decay) are standard. But the principle is the same: step size controls stability vs. speed.

The Zero-Grad Imperative

Forgetting p.grad = 0.0 before backward() is the single most common bug in deep learning. Gradients accumulate by default — it’s a feature for gradient accumulation across mini-batches, but a devastating bug if you forget to reset. PyTorch’s optimizer.zero_grad() exists solely for this reason.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Implement the training loop on the 4-sample dataset. Run 20 steps. Print loss each step. Watch it converge.

⏭️

DO IF TIME

Experiment with different learning rates: 0.001, 0.01, 0.1, 1.0. Observe convergence, oscillation, divergence.

Slow but Worth It
🖐

DO CAREFULLY

Print predictions after training: [model(x).data for x in xs]. Verify they match ys = [1, -1, -1, 1].

🚫

AVOID TODAY

Regularization, batch normalization, advanced optimizers. Train on the simplest possible example first. Complexity comes later.

V. Today’s Deliverables

The training loop is the heartbeat of machine learning. It is what separates a static function from a learning system. Today you felt the network go from random noise to correct predictions — not by being programmed, but by being nudged, one gradient step at a time. Tomorrow, you apply this to language itself. — Day 5 Closing Principle
Day 5 Notebook — Training the MLP with Gradient Descent Runnable Python

Full training loop: forward pass, MSE loss, backward, parameter update. Loss visualization and learning rate exploration.