Day 21 - Tokenization — Byte Pair Encoding from Scratch

1-202122···80

The resolution at which you measure determines what you can see. Character-level is too fine; word-level too coarse. BPE finds the Goldilocks granularity automatically.— Day 21 Principle

I. Why Tokenization Matters

Tokenization converts text to integers. BPE starts with characters and iteratively merges the most frequent pair into a new token. GPT-2 uses 50,257 tokens; GPT-4 uses ~100K.

# BPE algorithm (simplified)
vocab = {idx: bytes([idx]) for idx in range(256)}
for i in range(num_merges):
    pair = most_frequent_pair(ids)
    ids = merge(ids, pair, 256+i)
    vocab[256+i] = vocab[pair[0]] + vocab[pair[1]]

Tokenization Is the First Bottleneck

Bad tokenization wastes context window on subword fragments. Good tokenization compresses common patterns into single tokens, giving the model more effective context.

V. Deliverables

BPE from scratch: Count, merge, build vocab
Encode/decode: Text → ids → text roundtrip
Compare: Character-level vs BPE compression ratio
GPT-2 tokenizer: Load and compare with your implementation
Edge cases: Unicode, numbers, whitespace handling

The tokenizer is the model’s lens on language. Tomorrow: preparing data for LLM training.— Day 21 Closing