1 2 3 4 5 6 7 · · · 80
← Back to index
PHASE 1 Foundations · Day 6 of 80 · Neural Networks & Backprop

Intro to Language Modeling — The Bigram Model

The simplest possible language model: predict the next character using only the current one. From counting pairs to generating names.

The best predictions come from understanding what has happened before. In markets, the simplest model is “what did this asset do yesterday?” In language, it is “what letter usually follows this one?” The bigram model is this idea made precise — a 27×27 lookup table that captures the statistical structure of character-level English. Crude, but revelatory. — Day 6 Principle, adapted from the Marks framework

I. What is Language Modeling?

A language model assigns probabilities to sequences. Given a prefix, it predicts what comes next. The bigram model is the simplest version: P(next | current) — the probability of the next character depends only on the current character. We train it on a dataset of names (Karpathy uses 32,000+ names from names.txt).

Exhibit A — Bigram Frequency Matrix (Sample: Characters a–e)
NEXT CHARACTER → CURRENT → . a b c d . a b c 0 4410 1306 1542 1690 556 556 541 470 1407 114 2093 168 0 0 Darker green = higher count. Row normalization → probability distribution.

II. Building the Model — Count & Normalize

import torch # Build bigram counts from names dataset words = open('names.txt').read().splitlines() chars = sorted(list(set(''.join(words)))) stoi = {s:i+1 for i,s in enumerate(chars)} stoi['.'] = 0 itos = {i:s for s,i in stoi.items()} N = torch.zeros((27, 27), dtype=torch.int32) for w in words: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): N[stoi[ch1], stoi[ch2]] += 1 # Normalize to probabilities (add smoothing) P = (N + 1).float() P = P / P.sum(dim=1, keepdim=True)

III. Sampling & Evaluating — Generate Names, Measure Quality

# Generate names by sampling from bigram distribution for _ in range(5): out = [] ix = 0 # start token while True: p = P[ix] ix = torch.multinomial(p, num_samples=1).item() if ix == 0: break out.append(itos[ix]) print(''.join(out)) # Evaluate with negative log-likelihood log_likelihood = 0.0 n = 0 for w in words: chs = ['.'] + list(w) + ['.'] for ch1, ch2 in zip(chs, chs[1:]): prob = P[stoi[ch1], stoi[ch2]] log_likelihood += torch.log(prob) n += 1 nll = -log_likelihood / n print(f"avg NLL: {nll:.4f}") # ~2.45

Why Negative Log-Likelihood?

NLL converts probabilities into a loss: high probability → low loss, low probability → high loss. A perfect model (always assigns probability 1.0 to the correct next char) has NLL = 0. A uniform random model (1/27 chance) has NLL = log(27) ≈ 3.30. Our bigram achieves ∼2.45 — significantly better than random, but far from perfect. This is the baseline every future model must beat.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Build the 27×27 count matrix. Normalize to probabilities. Sample 10 names. Compute NLL.

⏭️

DO IF TIME

Visualize the count matrix with plt.imshow(). The patterns reveal English phonotactics — which letters follow which.

Slow but Worth It
🖐

DO CAREFULLY

Understand smoothing: why adding 1 to all counts prevents log(0) and what it implies about rare character pairs.

🚫

AVOID TODAY

Using neural networks for language modeling. Today is purely counting-based. The neural version comes on Day 9.

V. Today’s Deliverables

The bigram model is the “index fund” of language modeling: simple, well-understood, and surprisingly hard to beat with naive approaches. It sets a floor. Everything from here — MLPs, RNNs, Transformers — is an attempt to capture longer-range dependencies that bigrams miss. Tomorrow: tensors & broadcasting, the tools you need. — Day 6 Closing Principle
Day 6 Notebook — The Bigram Language Model Runnable Python

Character-level bigram model: count matrix, probability normalization, name generation, and NLL evaluation.