I. The Vanishing Gradient Problem
In a deep network, gradients flow backward through many layers. At each layer with tanh activation,
gradients are multiplied by a factor in (0, 1). After 10 layers: 0.710 = 0.028 —
the gradient is 97% gone. Early layers barely learn. This is the vanishing gradient problem,
and it was the primary obstacle to training deep networks for decades.
II. Diagnosing Activations — The Histogram Test
The diagnostic tool: histogram the activations and gradients at each layer. Healthy activations are roughly Gaussian around zero. Saturated activations (all near ±1 for tanh) mean dead gradients. Activations clustered at zero mean the layer is doing nothing.
The Dead Neuron Problem
With ReLU activation, if a neuron’s output goes negative, its gradient is exactly zero. Once dead, it stays dead — no gradient means no update. In deep ReLU networks, up to 20-40% of neurons can die during training. Leaky ReLU (max(0.01x, x)) fixes this by providing a small gradient for negative inputs.
III. Weight Initialization — Kaiming vs. Xavier
Proper initialization keeps activations in a healthy range. Xavier/Glorot init works for
sigmoid/tanh: scale weights by 1/√n_in. Kaiming/He init works for ReLU:
scale by √(2/n_in). Wrong initialization + deep network = instant gradient death.
IV. The Matrix — What Matters Today
DO FIRST
Build a 5-layer MLP. Histogram activations at each layer. Observe the vanishing gradient effect with bad init.
DO IF TIME
Compare tanh vs. ReLU vs. LeakyReLU activation histograms. See which maintains gradients best through depth.
DO CAREFULLY
Apply Kaiming init and re-check histograms. The improvement should be dramatic — activations stay well-behaved.
AVOID TODAY
Batch normalization. That’s tomorrow’s fix. Today is purely about understanding the problem before applying the solution.
V. Today’s Deliverables
- Activation histograms: Plot activations at each layer of a deep MLP
- Gradient histograms: Plot gradient magnitudes at each layer
- Vanishing demo: Show gradients shrinking by 10× or more across 5 layers
- Initialization fix: Apply Kaiming/Xavier init and show improved gradient flow
- Saturation metric: Compute % of saturated neurons (>0.97) at each layer
- Comparison: tanh vs. ReLU gradient flow in a 5-layer network
Diagnosing vanishing gradients: activation histograms, saturation analysis, Kaiming vs Xavier initialization, and gradient flow visualization.