Day 6 — Intro to Language Modeling

The best predictions come from understanding what has happened before. In markets, the simplest model is “what did this asset do yesterday?” In language, it is “what letter usually follows this one?” The bigram model is this idea made precise — a 27×27 lookup table that captures the statistical structure of character-level English. Crude, but revelatory. — Day 6 Principle, adapted from the Marks framework

I. What is Language Modeling?

A language model assigns probabilities to sequences. Given a prefix, it predicts what comes next. The bigram model is the simplest version: P(next | current) — the probability of the next character depends only on the current character. We train it on a dataset of names (Karpathy uses 32,000+ names from names.txt).

Exhibit A — Bigram Frequency Matrix (Sample: Characters a–e)

II. Building the Model — Count & Normalize

import torch

# Build bigram counts from names dataset
words = open('names.txt').read().splitlines()
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

N = torch.zeros((27, 27), dtype=torch.int32)
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        N[stoi[ch1], stoi[ch2]] += 1

# Normalize to probabilities (add smoothing)
P = (N + 1).float()
P = P / P.sum(dim=1, keepdim=True)
  

III. Sampling & Evaluating — Generate Names, Measure Quality

# Generate names by sampling from bigram distribution
for _ in range(5):
    out = []
    ix = 0  # start token
    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1).item()
        if ix == 0: break
        out.append(itos[ix])
    print(''.join(out))

# Evaluate with negative log-likelihood
log_likelihood = 0.0
n = 0
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        prob = P[stoi[ch1], stoi[ch2]]
        log_likelihood += torch.log(prob)
        n += 1
nll = -log_likelihood / n
print(f"avg NLL: {nll:.4f}")  # ~2.45
  

Why Negative Log-Likelihood?

NLL converts probabilities into a loss: high probability → low loss, low probability → high loss. A perfect model (always assigns probability 1.0 to the correct next char) has NLL = 0. A uniform random model (1/27 chance) has NLL = log(27) ≈ 3.30. Our bigram achieves ∼2.45 — significantly better than random, but far from perfect. This is the baseline every future model must beat.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Build the 27×27 count matrix. Normalize to probabilities. Sample 10 names. Compute NLL.

⏭️

DO IF TIME

Visualize the count matrix with plt.imshow(). The patterns reveal English phonotactics — which letters follow which.

Slow but Worth It

🖐

DO CAREFULLY

Understand smoothing: why adding 1 to all counts prevents log(0) and what it implies about rare character pairs.

🚫

AVOID TODAY

Using neural networks for language modeling. Today is purely counting-based. The neural version comes on Day 9.

V. Today’s Deliverables

Character mapping: Build stoi and itos for 27 characters (a-z + special token)
Count matrix: Populate a 27×27 tensor from the names dataset
Probability matrix: Row-normalize with +1 smoothing
Sampling: Generate 10 names by sampling character by character
Evaluation: Compute average NLL over the full dataset (∼2.45)
Baseline: Record the NLL — every future model must beat this number

The bigram model is the “index fund” of language modeling: simple, well-understood, and surprisingly hard to beat with naive approaches. It sets a floor. Everything from here — MLPs, RNNs, Transformers — is an attempt to capture longer-range dependencies that bigrams miss. Tomorrow: tensors & broadcasting, the tools you need. — Day 6 Closing Principle

Day 6 Notebook — The Bigram Language Model Runnable Python

Character-level bigram model: count matrix, probability normalization, name generation, and NLL evaluation.

▶ Open in Colab View on GitHub nbviewer