Day 14 — Cross-Entropy, Softmax & Classification Gradients

Every investment decision is ultimately a classification: buy, hold, or sell. The confidence in that classification determines position sizing. In neural networks, softmax converts raw scores into confidences, and cross-entropy measures how wrong those confidences are. Today you understand both at the deepest level.— Day 14 Principle, adapted from the Marks framework

I. Softmax — From Logits to Probabilities

Softmax converts a vector of real numbers (logits) into a probability distribution. Each element becomes exp(z_i) / sum(exp(z_j)). The result is always positive and sums to 1. The exp amplifies differences: a logit gap of 2.0 becomes a probability ratio of ∼7×.

# Naive softmax (numerically unstable)
def softmax_naive(logits):
    counts = logits.exp()
    return counts / counts.sum(-1, keepdim=True)

# Stable softmax (subtract max for numerical stability)
def softmax_stable(logits):
    logits = logits - logits.max(-1, keepdim=True).values
    counts = logits.exp()
    return counts / counts.sum(-1, keepdim=True)
  

II. Cross-Entropy Loss — The Elegant Gradient

Cross-entropy loss for classification: L = -log(p[correct_class]). The gradient with respect to logits has a beautiful closed form: dL/dz = softmax(z) - one_hot(target). This means the gradient is simply “what the model predicted minus what should have been predicted.”

Why F.cross_entropy Combines Everything

Computing exp, then log, then negating is numerically wasteful and dangerous (overflow/underflow). PyTorch’s F.cross_entropy uses the log-sum-exp trick internally: log(softmax(z)) = z - log(sum(exp(z))). This is exact, stable, and fast. Always use it instead of manual softmax + log + nll.

III. Temperature & Sharpness

# Temperature scaling: sharper or softer distributions
temperature = 0.5  # <1 = sharper (more confident)
probs = softmax(logits / temperature)

temperature = 2.0  # >1 = softer (more uniform)
probs = softmax(logits / temperature)

# At temperature → 0: argmax (deterministic)
# At temperature → inf: uniform distribution
  

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Implement stable softmax. Derive the cross-entropy gradient on paper. Verify: dlogits = probs - one_hot.

⏭️

DO IF TIME

Demonstrate numerical overflow in naive softmax with large logits (>100). Show the stable version handles it.

Slow but Worth It

🖐

DO CAREFULLY

Experiment with temperature: plot softmax output at T=0.1, 0.5, 1.0, 2.0, 10.0. Understand the full spectrum.

🚫

AVOID TODAY

Label smoothing, focal loss, or other loss variants. Master the standard cross-entropy first.

V. Today’s Deliverables

Stable softmax: Implement with max-subtraction trick
Cross-entropy: Implement NLL manually, verify against F.cross_entropy
Gradient derivation: Prove that dL/dz = softmax(z) - one_hot(y)
Temperature: Demonstrate sharpening and softening of distributions
Numerical test: Show overflow with naive softmax on logits > 100
Integration: Replace manual softmax+NLL with F.cross_entropy in makemore

Cross-entropy and softmax are the foundation of every classification model from bigrams to GPT-4. You now understand them at the level of their gradients. Tomorrow: dilated causal convolutions and the WaveNet architecture.— Day 14 Closing Principle

Day 14 Notebook — Cross-Entropy, Softmax & Classification Runnable Python

Naive vs stable softmax, temperature scaling visualization, cross-entropy gradient derivation, and the beautiful (prediction - truth) formula.

▶ Open in Colab View on GitHub nbviewer