1-10 11 12 · · · 80
← Back to index
PHASE 2 Deep Networks · Day 11 of 80 · makemore & GPT

Activations & Gradients — The Fragility of Deep Nets

Why deep networks break: vanishing and exploding gradients, dead neurons, and saturated activations. Diagnose before you cure.

Risk in a portfolio doesn’t come from what you see — it comes from what you don’t. A position that looks stable can have hidden leverage, correlated exposures, or tail risk that only manifests in crisis. Deep networks have the same hidden fragility: activations that saturate, gradients that vanish, neurons that die. Today you learn to see the invisible risks inside your network. — Day 11 Principle, adapted from the Marks framework

I. The Vanishing Gradient Problem

In a deep network, gradients flow backward through many layers. At each layer with tanh activation, gradients are multiplied by a factor in (0, 1). After 10 layers: 0.710 = 0.028 — the gradient is 97% gone. Early layers barely learn. This is the vanishing gradient problem, and it was the primary obstacle to training deep networks for decades.

Exhibit A — Gradient Magnitude Across Layers (tanh, no BatchNorm)
|GRADIENT| LAYER (output → input) L10 1.0 L9 L8 L7 L6 L5 L1-4 ≈0 Gradients shrink exponentially with depth

II. Diagnosing Activations — The Histogram Test

The diagnostic tool: histogram the activations and gradients at each layer. Healthy activations are roughly Gaussian around zero. Saturated activations (all near ±1 for tanh) mean dead gradients. Activations clustered at zero mean the layer is doing nothing.

# After forward pass, check each layer’s activations for i, layer in enumerate(layers): h = layer.out # store activations during forward print(f"layer {i}: mean={h.mean():.4f}, std={h.std():.4f}, " f"saturated={(h.abs() > 0.97).float().mean():.2f}") # Healthy: mean ≈ 0, std ≈ 0.6, saturated < 5% # Bad: mean ≠ 0, std ≈ 1.0, saturated > 20%

The Dead Neuron Problem

With ReLU activation, if a neuron’s output goes negative, its gradient is exactly zero. Once dead, it stays dead — no gradient means no update. In deep ReLU networks, up to 20-40% of neurons can die during training. Leaky ReLU (max(0.01x, x)) fixes this by providing a small gradient for negative inputs.

III. Weight Initialization — Kaiming vs. Xavier

Proper initialization keeps activations in a healthy range. Xavier/Glorot init works for sigmoid/tanh: scale weights by 1/√n_in. Kaiming/He init works for ReLU: scale by √(2/n_in). Wrong initialization + deep network = instant gradient death.

# Kaiming initialization for ReLU W = torch.randn((fan_in, fan_out)) * (2 / fan_in)**0.5 # Xavier initialization for tanh W = torch.randn((fan_in, fan_out)) * (1 / fan_in)**0.5 # This is what torch.nn.init.kaiming_normal_ does under the hood

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Build a 5-layer MLP. Histogram activations at each layer. Observe the vanishing gradient effect with bad init.

⏭️

DO IF TIME

Compare tanh vs. ReLU vs. LeakyReLU activation histograms. See which maintains gradients best through depth.

Slow but Worth It
🖐

DO CAREFULLY

Apply Kaiming init and re-check histograms. The improvement should be dramatic — activations stay well-behaved.

🚫

AVOID TODAY

Batch normalization. That’s tomorrow’s fix. Today is purely about understanding the problem before applying the solution.

V. Today’s Deliverables

You cannot fix what you cannot measure. Before applying batch normalization, residual connections, or any other technique, you must first see the problem. Activation and gradient histograms are your diagnostic tool. Tomorrow, you apply the cure: batch normalization. — Day 11 Closing Principle
Day 11 Notebook — Activations & Gradients Runnable Python

Diagnosing vanishing gradients: activation histograms, saturation analysis, Kaiming vs Xavier initialization, and gradient flow visualization.