Day 12 — Batch Normalization — Taming Internal Covariate Shift

Diversification doesn’t just reduce risk — it changes the nature of what you’re managing. By normalizing each position relative to the portfolio mean, you create stability that allows for more aggressive strategy. BatchNorm does the same for neural networks: by normalizing each layer’s inputs, it creates the stability that allows for deeper architectures and higher learning rates. — Day 12 Principle, adapted from the Marks framework

I. The Problem BatchNorm Solves

As training progresses, the distribution of each layer’s inputs shifts (because the previous layer’s weights change). This internal covariate shift forces each layer to continuously re-adapt. BatchNorm fixes this by normalizing each layer’s inputs to zero mean and unit variance, then learning a scale (γ) and shift (β) per feature.

class BatchNorm1d:
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True
        # Learnable parameters
        self.gamma = torch.ones(dim)
        self.beta  = torch.zeros(dim)
        # Running stats for inference
        self.running_mean = torch.zeros(dim)
        self.running_var  = torch.ones(dim)

    def __call__(self, x):
        if self.training:
            xmean = x.mean(0, keepdim=True)
            xvar  = x.var(0, keepdim=True)
        else:
            xmean = self.running_mean
            xvar  = self.running_var
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta
        if self.training:
            with torch.no_grad():
                self.running_mean = (1-self.momentum) * self.running_mean + self.momentum * xmean
                self.running_var  = (1-self.momentum) * self.running_var  + self.momentum * xvar
        return self.out
  

Train vs. Eval Mode

During training, BatchNorm uses per-batch statistics. During inference, it uses exponentially smoothed running statistics accumulated during training. Forgetting to switch to eval mode (model.eval()) is a common source of inference bugs — your model uses noisy batch stats instead of stable running stats.

Exhibit A — Before vs. After BatchNorm: Activation Distribution

III. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Implement BatchNorm1d from scratch. Insert it between linear layers in your MLP. Verify activations become centered.

⏭️

DO IF TIME

Compare training curves with and without BatchNorm. Note the faster convergence and tolerance for higher learning rates.

Slow but Worth It

🖐

DO CAREFULLY

Switch to eval mode and verify inference uses running stats. Confirm results differ from training mode batch stats.

🚫

AVOID TODAY

LayerNorm (that’s the Transformer variant, coming in Phase 2). Today is strictly BatchNorm.

V. Today’s Deliverables

BatchNorm1d: Implement from scratch with gamma, beta, running_mean, running_var
Integration: Add BatchNorm between each linear layer in the makemore MLP
Histograms: Show activation distributions before and after BatchNorm
Train vs. eval: Demonstrate the difference between batch stats and running stats
Learning rate: Show that BatchNorm allows 2-5× higher learning rates
Loss improvement: Achieve lower dev NLL than Day 10’s best model

BatchNorm was the breakthrough that made deep networks practical. Before it, training a 10-layer network was an art. After it, it became engineering. Tomorrow you become a backprop ninja — computing gradients by hand through every tensor operation. — Day 12 Closing Principle

Day 12 Notebook — Batch Normalization from Scratch Runnable Python

BatchNorm1d implementation, before/after comparison, train vs eval mode, and running statistics convergence.

▶ Open in Colab View on GitHub nbviewer