I. The Training Loop — Four Steps, Infinite Repetition
Every neural network training loop in history — from a 41-parameter micrograd MLP to a trillion-parameter foundation model — follows the same four steps: (1) forward pass, (2) compute loss, (3) backward pass, (4) update parameters. Everything else is optimization around this core.
II. The Code — Training a Toy MLP
We define a tiny dataset of 4 samples (Karpathy’s classic example), create an MLP, and run the training loop. In ~20 steps, loss drops from ∼5.0 to near zero. The network learns.
III. Watching the Loss Fall — The Descent Curve
Learning Rate: The One Hyperparameter That Matters Most
Too high (>0.1) and loss oscillates or diverges. Too low (<0.001) and training crawls. The sweet spot for this toy MLP is lr ≈ 0.01–0.1. In real networks, learning rate schedules (warmup + cosine decay) are standard. But the principle is the same: step size controls stability vs. speed.
The Zero-Grad Imperative
Forgetting p.grad = 0.0 before backward() is the single most common bug in deep learning. Gradients accumulate by default — it’s a feature for gradient accumulation across mini-batches, but a devastating bug if you forget to reset. PyTorch’s optimizer.zero_grad() exists solely for this reason.
IV. The Matrix — What Matters Today
DO FIRST
Implement the training loop on the 4-sample dataset. Run 20 steps. Print loss each step. Watch it converge.
DO IF TIME
Experiment with different learning rates: 0.001, 0.01, 0.1, 1.0. Observe convergence, oscillation, divergence.
DO CAREFULLY
Print predictions after training: [model(x).data for x in xs]. Verify they match ys = [1, -1, -1, 1].
AVOID TODAY
Regularization, batch normalization, advanced optimizers. Train on the simplest possible example first. Complexity comes later.
V. Today’s Deliverables
- Training loop: Implement the full 4-step loop from scratch
- Loss function: Mean squared error:
sum((yout - yt)**2 for ...) - Zero gradients: Reset all
p.grad = 0.0before each backward pass - Learning rate: Test at least 3 different values and observe behavior
- Convergence: Achieve loss < 0.01 on the 4-sample dataset
- Predictions: Print final predictions and verify they match targets
Full training loop: forward pass, MSE loss, backward, parameter update. Loss visualization and learning rate exploration.