1-11 12 13 · · · 80
← Back to index
PHASE 2 Deep Networks · Day 12 of 80 · makemore & GPT

Batch Normalization — Taming Internal Covariate Shift

Implement BatchNorm from scratch: running mean, running variance, gamma, beta. Understand why it makes deep networks trainable.

Diversification doesn’t just reduce risk — it changes the nature of what you’re managing. By normalizing each position relative to the portfolio mean, you create stability that allows for more aggressive strategy. BatchNorm does the same for neural networks: by normalizing each layer’s inputs, it creates the stability that allows for deeper architectures and higher learning rates. — Day 12 Principle, adapted from the Marks framework

I. The Problem BatchNorm Solves

As training progresses, the distribution of each layer’s inputs shifts (because the previous layer’s weights change). This internal covariate shift forces each layer to continuously re-adapt. BatchNorm fixes this by normalizing each layer’s inputs to zero mean and unit variance, then learning a scale (γ) and shift (β) per feature.

class BatchNorm1d: def __init__(self, dim, eps=1e-5, momentum=0.1): self.eps = eps self.momentum = momentum self.training = True # Learnable parameters self.gamma = torch.ones(dim) self.beta = torch.zeros(dim) # Running stats for inference self.running_mean = torch.zeros(dim) self.running_var = torch.ones(dim) def __call__(self, x): if self.training: xmean = x.mean(0, keepdim=True) xvar = x.var(0, keepdim=True) else: xmean = self.running_mean xvar = self.running_var xhat = (x - xmean) / torch.sqrt(xvar + self.eps) self.out = self.gamma * xhat + self.beta if self.training: with torch.no_grad(): self.running_mean = (1-self.momentum) * self.running_mean + self.momentum * xmean self.running_var = (1-self.momentum) * self.running_var + self.momentum * xvar return self.out

Train vs. Eval Mode

During training, BatchNorm uses per-batch statistics. During inference, it uses exponentially smoothed running statistics accumulated during training. Forgetting to switch to eval mode (model.eval()) is a common source of inference bugs — your model uses noisy batch stats instead of stable running stats.

Exhibit A — Before vs. After BatchNorm: Activation Distribution
WITHOUT BATCHNORM WITH BATCHNORM shifted, wide centered, unit variance mean=0

III. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Implement BatchNorm1d from scratch. Insert it between linear layers in your MLP. Verify activations become centered.

⏭️

DO IF TIME

Compare training curves with and without BatchNorm. Note the faster convergence and tolerance for higher learning rates.

Slow but Worth It
🖐

DO CAREFULLY

Switch to eval mode and verify inference uses running stats. Confirm results differ from training mode batch stats.

🚫

AVOID TODAY

LayerNorm (that’s the Transformer variant, coming in Phase 2). Today is strictly BatchNorm.

V. Today’s Deliverables

BatchNorm was the breakthrough that made deep networks practical. Before it, training a 10-layer network was an art. After it, it became engineering. Tomorrow you become a backprop ninja — computing gradients by hand through every tensor operation. — Day 12 Closing Principle
Day 12 Notebook — Batch Normalization from Scratch Runnable Python

BatchNorm1d implementation, before/after comparison, train vs eval mode, and running statistics convergence.