Day 5 — Training the MLP — Gradient Descent in Action

In investing, the feedback loop is slow: you place a bet, wait months, and observe the outcome. In machine learning, the loop is immediate: forward pass → loss → backward pass → update → repeat. Gradient descent is the fastest feedback loop in all of applied mathematics. Today you feel it run. — Day 5 Principle, adapted from the Marks framework

I. The Training Loop — Four Steps, Infinite Repetition

Every neural network training loop in history — from a 41-parameter micrograd MLP to a trillion-parameter foundation model — follows the same four steps: (1) forward pass, (2) compute loss, (3) backward pass, (4) update parameters. Everything else is optimization around this core.

Exhibit A — The Training Loop: Forward → Loss → Backward → Update

II. The Code — Training a Toy MLP

We define a tiny dataset of 4 samples (Karpathy’s classic example), create an MLP, and run the training loop. In ~20 steps, loss drops from ∼5.0 to near zero. The network learns.

# Tiny dataset: 4 input-output pairs
xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0]  # desired targets

model = MLP(3, [4, 4, 1])

for k in range(20):
    # 1. Forward pass
    ypred = [model(x) for x in xs]

    # 2. Compute loss (mean squared error)
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

    # 3. Backward pass
    for p in model.parameters():
        p.grad = 0.0       # zero grads first!
    loss.backward()

    # 4. Update parameters
    for p in model.parameters():
        p.data += -0.05 * p.grad    # lr = 0.05

    print(f"step {k}: loss = {loss.data:.4f}")
  

III. Watching the Loss Fall — The Descent Curve

Exhibit B — Loss vs. Training Step (Schematic)

Learning Rate: The One Hyperparameter That Matters Most

Too high (>0.1) and loss oscillates or diverges. Too low (<0.001) and training crawls. The sweet spot for this toy MLP is lr ≈ 0.01–0.1. In real networks, learning rate schedules (warmup + cosine decay) are standard. But the principle is the same: step size controls stability vs. speed.

The Zero-Grad Imperative

Forgetting p.grad = 0.0 before backward() is the single most common bug in deep learning. Gradients accumulate by default — it’s a feature for gradient accumulation across mini-batches, but a devastating bug if you forget to reset. PyTorch’s optimizer.zero_grad() exists solely for this reason.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Implement the training loop on the 4-sample dataset. Run 20 steps. Print loss each step. Watch it converge.

⏭️

DO IF TIME

Experiment with different learning rates: 0.001, 0.01, 0.1, 1.0. Observe convergence, oscillation, divergence.

Slow but Worth It

🖐

DO CAREFULLY

Print predictions after training: [model(x).data for x in xs]. Verify they match ys = [1, -1, -1, 1].

🚫

AVOID TODAY

Regularization, batch normalization, advanced optimizers. Train on the simplest possible example first. Complexity comes later.

V. Today’s Deliverables

Training loop: Implement the full 4-step loop from scratch
Loss function: Mean squared error: sum((yout - yt)**2 for ...)
Zero gradients: Reset all p.grad = 0.0 before each backward pass
Learning rate: Test at least 3 different values and observe behavior
Convergence: Achieve loss < 0.01 on the 4-sample dataset
Predictions: Print final predictions and verify they match targets

The training loop is the heartbeat of machine learning. It is what separates a static function from a learning system. Today you felt the network go from random noise to correct predictions — not by being programmed, but by being nudged, one gradient step at a time. Tomorrow, you apply this to language itself. — Day 5 Closing Principle

Day 5 Notebook — Training the MLP with Gradient Descent Runnable Python

Full training loop: forward pass, MSE loss, backward, parameter update. Loss visualization and learning rate exploration.

▶ Open in Colab View on GitHub nbviewer