I. Softmax — From Logits to Probabilities
Softmax converts a vector of real numbers (logits) into a probability distribution. Each element becomes
exp(z_i) / sum(exp(z_j)). The result is always positive and sums to 1. The exp
amplifies differences: a logit gap of 2.0 becomes a probability ratio of ∼7×.
II. Cross-Entropy Loss — The Elegant Gradient
Cross-entropy loss for classification: L = -log(p[correct_class]). The gradient with respect
to logits has a beautiful closed form: dL/dz = softmax(z) - one_hot(target). This means the
gradient is simply “what the model predicted minus what should have been predicted.”
Why F.cross_entropy Combines Everything
Computing exp, then log, then negating is numerically wasteful and dangerous (overflow/underflow). PyTorch’s F.cross_entropy uses the log-sum-exp trick internally: log(softmax(z)) = z - log(sum(exp(z))). This is exact, stable, and fast. Always use it instead of manual softmax + log + nll.
III. Temperature & Sharpness
IV. The Matrix — What Matters Today
DO FIRST
Implement stable softmax. Derive the cross-entropy gradient on paper. Verify: dlogits = probs - one_hot.
DO IF TIME
Demonstrate numerical overflow in naive softmax with large logits (>100). Show the stable version handles it.
DO CAREFULLY
Experiment with temperature: plot softmax output at T=0.1, 0.5, 1.0, 2.0, 10.0. Understand the full spectrum.
AVOID TODAY
Label smoothing, focal loss, or other loss variants. Master the standard cross-entropy first.
V. Today’s Deliverables
- Stable softmax: Implement with max-subtraction trick
- Cross-entropy: Implement NLL manually, verify against
F.cross_entropy - Gradient derivation: Prove that
dL/dz = softmax(z) - one_hot(y) - Temperature: Demonstrate sharpening and softening of distributions
- Numerical test: Show overflow with naive softmax on logits > 100
- Integration: Replace manual softmax+NLL with
F.cross_entropyin makemore
Naive vs stable softmax, temperature scaling visualization, cross-entropy gradient derivation, and the beautiful (prediction - truth) formula.