1-7 8 9 10 · · · 80
← Back to index
PHASE 1 Foundations · Day 8 of 80 · Neural Networks & Backprop

Training Loops, Loss Functions & Evaluation Splits

The neural bigram model: replace counting with gradient descent. Split data into train/dev/test. Learn proper model evaluation.

You cannot evaluate a fund’s returns using the same data that chose the strategy. In-sample performance is not out-of-sample truth. The discipline of train/dev/test splits is the machine learning equivalent of this principle: you must evaluate on data the model has never seen. — Day 8 Principle, adapted from the Marks framework

I. From Counting to Gradient Descent

Yesterday’s bigram model counted character pairs and normalized. Today we replace counting with a neural network: a 27×27 weight matrix W, trained via gradient descent, that learns the same probabilities. The result is numerically identical — but the method generalizes to any architecture.

Exhibit A — Neural Bigram: One-Hot → W → Softmax → Loss
one_hot(x) [B, 27] @ W [27, 27] logits softmax [B, 27] probs -log(P[target]) NLL loss .backward()

II. The Code — Neural Bigram with Gradient Descent

import torch import torch.nn.functional as F # Prepare dataset xs, ys = [], [] for w in words: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): xs.append(stoi[ch1]) ys.append(stoi[ch2]) xs = torch.tensor(xs) ys = torch.tensor(ys) num = xs.nelement() # Initialize weight matrix W = torch.randn((27, 27), requires_grad=True) # Training loop for k in range(100): # Forward pass xenc = F.one_hot(xs, num_classes=27).float() logits = xenc @ W # [N, 27] counts = logits.exp() # softmax numerator probs = counts / counts.sum(1, keepdim=True) # Loss: negative log-likelihood + regularization loss = -probs[torch.arange(num), ys].log().mean() loss += 0.01 * (W**2).mean() # L2 regularization # Backward + update W.grad = None loss.backward() W.data += -50 * W.grad # larger lr for 27x27 matrix print(f"final loss: {loss.item():.4f}") # ~2.46, matching counting approach

III. Train / Dev / Test Splits — The Discipline of Evaluation

Exhibit B — Data Split Strategy: 80% / 10% / 10%
TRAIN (80%) DEV (10%) TEST (10%) Fit parameters Tune hyperparams Final score (1×)
import random random.shuffle(words) n1 = int(0.8 * len(words)) n2 = int(0.9 * len(words)) train_words = words[:n1] dev_words = words[n1:n2] test_words = words[n2:] print(f"train: {len(train_words)}, dev: {len(dev_words)}, test: {len(test_words)}")

The Loss Should Match

The neural bigram with gradient descent should converge to the same NLL (∼2.45) as the counting approach from Day 6. If it doesn’t, you have a bug. This equivalence is the proof that the neural method is working correctly — it has learned the exact same probability distribution through optimization.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Implement the neural bigram training loop. Verify final NLL matches the counting approach (∼2.45).

⏭️

DO IF TIME

Implement train/dev/test splits. Train on train, evaluate on dev. Confirm they’re close (for this simple model).

Slow but Worth It
🖐

DO CAREFULLY

Add L2 regularization (0.01 * (W**2).mean()). Understand how it pushes weights toward uniform probabilities when the data is sparse.

🚫

AVOID TODAY

Using nn.CrossEntropyLoss or any PyTorch abstractions. Implement softmax and NLL manually to understand what they do.

V. Today’s Deliverables

The neural bigram produces the same result as counting — but the framework is what matters. This framework scales. Tomorrow you replace the single weight matrix with an MLP, add embeddings, and suddenly the model can see more than one character of context. That is the leap from bigram to language model. — Day 8 Closing Principle
Day 8 Notebook — Training Loops, Loss Functions & Splits Runnable Python

Neural bigram with cross-entropy, L2 regularization, train/dev/test splits, and comparison to count-based model.