PHASE 1 Foundations · Day 6 of 80 · Neural Networks & Backprop
Intro to Language Modeling — The Bigram Model
The simplest possible language model: predict the next character using only the current one. From counting pairs to generating names.
The best predictions come from understanding what has happened before. In markets, the simplest model is
“what did this asset do yesterday?” In language, it is “what letter usually follows this one?”
The bigram model is this idea made precise — a 27×27 lookup table that captures the statistical
structure of character-level English. Crude, but revelatory.
— Day 6 Principle, adapted from the Marks framework
I. What is Language Modeling?
A language model assigns probabilities to sequences. Given a prefix, it predicts what comes next.
The bigram model is the simplest version: P(next | current) — the probability of the next character
depends only on the current character. We train it on a dataset of names (Karpathy uses 32,000+
names from names.txt).
Exhibit A — Bigram Frequency Matrix (Sample: Characters a–e)
II. Building the Model — Count & Normalize
import torch
# Build bigram counts from names dataset
words = open('names.txt').read().splitlines()
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1for i,s inenumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
N = torch.zeros((27, 27), dtype=torch.int32)
for w in words:
chs = ['.'] +list(w) + ['.']
for ch1, ch2 inzip(chs, chs[1:]):
N[stoi[ch1], stoi[ch2]] +=1# Normalize to probabilities (add smoothing)
P = (N +1).float()
P = P / P.sum(dim=1, keepdim=True)
III. Sampling & Evaluating — Generate Names, Measure Quality
# Generate names by sampling from bigram distributionfor _ inrange(5):
out = []
ix = 0# start tokenwhileTrue:
p = P[ix]
ix = torch.multinomial(p, num_samples=1).item()
if ix ==0: break
out.append(itos[ix])
print(''.join(out))
# Evaluate with negative log-likelihood
log_likelihood = 0.0
n = 0for w in words:
chs = ['.'] +list(w) + ['.']
for ch1, ch2 inzip(chs, chs[1:]):
prob = P[stoi[ch1], stoi[ch2]]
log_likelihood += torch.log(prob)
n +=1
nll = -log_likelihood / n
print(f"avg NLL: {nll:.4f}") # ~2.45
Why Negative Log-Likelihood?
NLL converts probabilities into a loss: high probability → low loss, low probability → high loss. A perfect model (always assigns probability 1.0 to the correct next char) has NLL = 0. A uniform random model (1/27 chance) has NLL = log(27) ≈ 3.30. Our bigram achieves ∼2.45 — significantly better than random, but far from perfect. This is the baseline every future model must beat.
IV. The Matrix — What Matters Today
Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯
DO FIRST
Build the 27×27 count matrix. Normalize to probabilities. Sample 10 names. Compute NLL.
⏭️
DO IF TIME
Visualize the count matrix with plt.imshow(). The patterns reveal English phonotactics — which letters follow which.
Slow but Worth It
🖐
DO CAREFULLY
Understand smoothing: why adding 1 to all counts prevents log(0) and what it implies about rare character pairs.
🚫
AVOID TODAY
Using neural networks for language modeling. Today is purely counting-based. The neural version comes on Day 9.
V. Today’s Deliverables
Character mapping: Build stoi and itos for 27 characters (a-z + special token)
Count matrix: Populate a 27×27 tensor from the names dataset
Probability matrix: Row-normalize with +1 smoothing
Sampling: Generate 10 names by sampling character by character
Evaluation: Compute average NLL over the full dataset (∼2.45)
Baseline: Record the NLL — every future model must beat this number
The bigram model is the “index fund” of language modeling: simple, well-understood, and surprisingly hard
to beat with naive approaches. It sets a floor. Everything from here — MLPs, RNNs, Transformers — is an
attempt to capture longer-range dependencies that bigrams miss. Tomorrow: tensors & broadcasting, the tools you need.
— Day 6 Closing Principle
Day 6 Notebook — The Bigram Language ModelRunnable Python
Character-level bigram model: count matrix, probability normalization, name generation, and NLL evaluation.