1-13 14 15 ··· 80
← Back to index
PHASE 2 Deep Networks · Day 14 of 80 · makemore & GPT

Cross-Entropy, Softmax & Classification Gradients

Deep dive into the loss function that powers all of deep learning classification: its math, its numerical tricks, and its gradient.

Every investment decision is ultimately a classification: buy, hold, or sell. The confidence in that classification determines position sizing. In neural networks, softmax converts raw scores into confidences, and cross-entropy measures how wrong those confidences are. Today you understand both at the deepest level.— Day 14 Principle, adapted from the Marks framework

I. Softmax — From Logits to Probabilities

Softmax converts a vector of real numbers (logits) into a probability distribution. Each element becomes exp(z_i) / sum(exp(z_j)). The result is always positive and sums to 1. The exp amplifies differences: a logit gap of 2.0 becomes a probability ratio of ∼7×.

# Naive softmax (numerically unstable) def softmax_naive(logits): counts = logits.exp() return counts / counts.sum(-1, keepdim=True) # Stable softmax (subtract max for numerical stability) def softmax_stable(logits): logits = logits - logits.max(-1, keepdim=True).values counts = logits.exp() return counts / counts.sum(-1, keepdim=True)

II. Cross-Entropy Loss — The Elegant Gradient

Cross-entropy loss for classification: L = -log(p[correct_class]). The gradient with respect to logits has a beautiful closed form: dL/dz = softmax(z) - one_hot(target). This means the gradient is simply “what the model predicted minus what should have been predicted.”

Why F.cross_entropy Combines Everything

Computing exp, then log, then negating is numerically wasteful and dangerous (overflow/underflow). PyTorch’s F.cross_entropy uses the log-sum-exp trick internally: log(softmax(z)) = z - log(sum(exp(z))). This is exact, stable, and fast. Always use it instead of manual softmax + log + nll.

III. Temperature & Sharpness

# Temperature scaling: sharper or softer distributions temperature = 0.5 # <1 = sharper (more confident) probs = softmax(logits / temperature) temperature = 2.0 # >1 = softer (more uniform) probs = softmax(logits / temperature) # At temperature → 0: argmax (deterministic) # At temperature → inf: uniform distribution

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Implement stable softmax. Derive the cross-entropy gradient on paper. Verify: dlogits = probs - one_hot.

⏭️

DO IF TIME

Demonstrate numerical overflow in naive softmax with large logits (>100). Show the stable version handles it.

Slow but Worth It
🖐

DO CAREFULLY

Experiment with temperature: plot softmax output at T=0.1, 0.5, 1.0, 2.0, 10.0. Understand the full spectrum.

🚫

AVOID TODAY

Label smoothing, focal loss, or other loss variants. Master the standard cross-entropy first.

V. Today’s Deliverables

Cross-entropy and softmax are the foundation of every classification model from bigrams to GPT-4. You now understand them at the level of their gradients. Tomorrow: dilated causal convolutions and the WaveNet architecture.— Day 14 Closing Principle
Day 14 Notebook — Cross-Entropy, Softmax & Classification Runnable Python

Naive vs stable softmax, temperature scaling visualization, cross-entropy gradient derivation, and the beautiful (prediction - truth) formula.