I. The Problem BatchNorm Solves
As training progresses, the distribution of each layer’s inputs shifts (because the previous layer’s
weights change). This internal covariate shift forces each layer to continuously re-adapt.
BatchNorm fixes this by normalizing each layer’s inputs to zero mean and unit variance, then learning
a scale (γ) and shift (β) per feature.
Train vs. Eval Mode
During training, BatchNorm uses per-batch statistics. During inference, it uses exponentially smoothed running statistics accumulated during training. Forgetting to switch to eval mode (model.eval()) is a common source of inference bugs — your model uses noisy batch stats instead of stable running stats.
III. The Matrix — What Matters Today
DO FIRST
Implement BatchNorm1d from scratch. Insert it between linear layers in your MLP. Verify activations become centered.
DO IF TIME
Compare training curves with and without BatchNorm. Note the faster convergence and tolerance for higher learning rates.
DO CAREFULLY
Switch to eval mode and verify inference uses running stats. Confirm results differ from training mode batch stats.
AVOID TODAY
LayerNorm (that’s the Transformer variant, coming in Phase 2). Today is strictly BatchNorm.
V. Today’s Deliverables
- BatchNorm1d: Implement from scratch with gamma, beta, running_mean, running_var
- Integration: Add BatchNorm between each linear layer in the makemore MLP
- Histograms: Show activation distributions before and after BatchNorm
- Train vs. eval: Demonstrate the difference between batch stats and running stats
- Learning rate: Show that BatchNorm allows 2-5× higher learning rates
- Loss improvement: Achieve lower dev NLL than Day 10’s best model
BatchNorm1d implementation, before/after comparison, train vs eval mode, and running statistics convergence.