Day 8 — Training Loops, Loss Functions & Evaluation Splits

You cannot evaluate a fund’s returns using the same data that chose the strategy. In-sample performance is not out-of-sample truth. The discipline of train/dev/test splits is the machine learning equivalent of this principle: you must evaluate on data the model has never seen. — Day 8 Principle, adapted from the Marks framework

I. From Counting to Gradient Descent

Yesterday’s bigram model counted character pairs and normalized. Today we replace counting with a neural network: a 27×27 weight matrix W, trained via gradient descent, that learns the same probabilities. The result is numerically identical — but the method generalizes to any architecture.

Exhibit A — Neural Bigram: One-Hot → W → Softmax → Loss

II. The Code — Neural Bigram with Gradient Descent

import torch
import torch.nn.functional as F

# Prepare dataset
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        xs.append(stoi[ch1])
        ys.append(stoi[ch2])
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()

# Initialize weight matrix
W = torch.randn((27, 27), requires_grad=True)

# Training loop
for k in range(100):
    # Forward pass
    xenc = F.one_hot(xs, num_classes=27).float()
    logits = xenc @ W              # [N, 27]
    counts = logits.exp()           # softmax numerator
    probs = counts / counts.sum(1, keepdim=True)

    # Loss: negative log-likelihood + regularization
    loss = -probs[torch.arange(num), ys].log().mean()
    loss += 0.01 * (W**2).mean()     # L2 regularization

    # Backward + update
    W.grad = None
    loss.backward()
    W.data += -50 * W.grad          # larger lr for 27x27 matrix

print(f"final loss: {loss.item():.4f}")  # ~2.46, matching counting approach
  

III. Train / Dev / Test Splits — The Discipline of Evaluation

Exhibit B — Data Split Strategy: 80% / 10% / 10%

import random
random.shuffle(words)
n1 = int(0.8 * len(words))
n2 = int(0.9 * len(words))

train_words = words[:n1]
dev_words   = words[n1:n2]
test_words  = words[n2:]

print(f"train: {len(train_words)}, dev: {len(dev_words)}, test: {len(test_words)}")
  

The Loss Should Match

The neural bigram with gradient descent should converge to the same NLL (∼2.45) as the counting approach from Day 6. If it doesn’t, you have a bug. This equivalence is the proof that the neural method is working correctly — it has learned the exact same probability distribution through optimization.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Implement the neural bigram training loop. Verify final NLL matches the counting approach (∼2.45).

⏭️

DO IF TIME

Implement train/dev/test splits. Train on train, evaluate on dev. Confirm they’re close (for this simple model).

Slow but Worth It

🖐

DO CAREFULLY

Add L2 regularization (0.01 * (W**2).mean()). Understand how it pushes weights toward uniform probabilities when the data is sparse.

🚫

AVOID TODAY

Using nn.CrossEntropyLoss or any PyTorch abstractions. Implement softmax and NLL manually to understand what they do.

V. Today’s Deliverables

Neural bigram: One-hot → W → softmax → NLL loss → backward → update
Loss convergence: Train for 100+ steps, verify NLL ≈ 2.45
Data splits: Implement 80/10/10 train/dev/test split
Evaluation: Compute loss separately on train and dev sets
Regularization: Add L2 penalty and observe its smoothing effect
Sampling: Generate names from the trained neural model, compare quality with Day 6

The neural bigram produces the same result as counting — but the framework is what matters. This framework scales. Tomorrow you replace the single weight matrix with an MLP, add embeddings, and suddenly the model can see more than one character of context. That is the leap from bigram to language model. — Day 8 Closing Principle

Day 8 Notebook — Training Loops, Loss Functions & Splits Runnable Python

Neural bigram with cross-entropy, L2 regularization, train/dev/test splits, and comparison to count-based model.

▶ Open in Colab View on GitHub nbviewer