PHASE 1 Foundations · Day 8 of 80 · Neural Networks & Backprop
Training Loops, Loss Functions & Evaluation Splits
The neural bigram model: replace counting with gradient descent. Split data into train/dev/test. Learn proper model evaluation.
You cannot evaluate a fund’s returns using the same data that chose the strategy. In-sample performance
is not out-of-sample truth. The discipline of train/dev/test splits is the machine learning equivalent of
this principle: you must evaluate on data the model has never seen.
— Day 8 Principle, adapted from the Marks framework
I. From Counting to Gradient Descent
Yesterday’s bigram model counted character pairs and normalized. Today we replace counting with a
neural network: a 27×27 weight matrix W, trained via gradient descent, that learns
the same probabilities. The result is numerically identical — but the method generalizes to any architecture.
Exhibit A — Neural Bigram: One-Hot → W → Softmax → Loss
II. The Code — Neural Bigram with Gradient Descent
import torch
import torch.nn.functional as F
# Prepare dataset
xs, ys = [], []
for w in words:
chs = ['.'] +list(w) + ['.']
for ch1, ch2 inzip(chs, chs[1:]):
xs.append(stoi[ch1])
ys.append(stoi[ch2])
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
# Initialize weight matrix
W = torch.randn((27, 27), requires_grad=True)
# Training loopfor k inrange(100):
# Forward pass
xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W # [N, 27]
counts = logits.exp() # softmax numerator
probs = counts / counts.sum(1, keepdim=True)
# Loss: negative log-likelihood + regularization
loss = -probs[torch.arange(num), ys].log().mean()
loss +=0.01* (W**2).mean() # L2 regularization# Backward + update
W.grad = None
loss.backward()
W.data +=-50* W.grad # larger lr for 27x27 matrixprint(f"final loss: {loss.item():.4f}") # ~2.46, matching counting approach
III. Train / Dev / Test Splits — The Discipline of Evaluation
The neural bigram with gradient descent should converge to the same NLL (∼2.45) as the counting approach from Day 6. If it doesn’t, you have a bug. This equivalence is the proof that the neural method is working correctly — it has learned the exact same probability distribution through optimization.
IV. The Matrix — What Matters Today
Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯
DO FIRST
Implement the neural bigram training loop. Verify final NLL matches the counting approach (∼2.45).
⏭️
DO IF TIME
Implement train/dev/test splits. Train on train, evaluate on dev. Confirm they’re close (for this simple model).
Slow but Worth It
🖐
DO CAREFULLY
Add L2 regularization (0.01 * (W**2).mean()). Understand how it pushes weights toward uniform probabilities when the data is sparse.
🚫
AVOID TODAY
Using nn.CrossEntropyLoss or any PyTorch abstractions. Implement softmax and NLL manually to understand what they do.
V. Today’s Deliverables
Neural bigram: One-hot → W → softmax → NLL loss → backward → update
Loss convergence: Train for 100+ steps, verify NLL ≈ 2.45
Data splits: Implement 80/10/10 train/dev/test split
Evaluation: Compute loss separately on train and dev sets
Regularization: Add L2 penalty and observe its smoothing effect
Sampling: Generate names from the trained neural model, compare quality with Day 6
The neural bigram produces the same result as counting — but the framework is what matters.
This framework scales. Tomorrow you replace the single weight matrix with an MLP, add embeddings, and
suddenly the model can see more than one character of context. That is the leap from bigram to language model.
— Day 8 Closing Principle
Day 8 Notebook — Training Loops, Loss Functions & SplitsRunnable Python
Neural bigram with cross-entropy, L2 regularization, train/dev/test splits, and comparison to count-based model.