The resolution at which you measure determines what you can see. Character-level is too fine; word-level too coarse. BPE finds the Goldilocks granularity automatically.— Day 21 Principle
I. Why Tokenization Matters
Tokenization converts text to integers. BPE starts with characters and iteratively merges the most frequent pair into a new token. GPT-2 uses 50,257 tokens; GPT-4 uses ~100K.
# BPE algorithm (simplified)
vocab = {idx: bytes([idx]) for idx in range(256)}
for i in range(num_merges):
pair = most_frequent_pair(ids)
ids = merge(ids, pair, 256+i)
vocab[256+i] = vocab[pair[0]] + vocab[pair[1]]
Tokenization Is the First Bottleneck
Bad tokenization wastes context window on subword fragments. Good tokenization compresses common patterns into single tokens, giving the model more effective context.
V. Deliverables
- BPE from scratch: Count, merge, build vocab
- Encode/decode: Text → ids → text roundtrip
- Compare: Character-level vs BPE compression ratio
- GPT-2 tokenizer: Load and compare with your implementation
- Edge cases: Unicode, numbers, whitespace handling
The tokenizer is the model’s lens on language. Tomorrow: preparing data for LLM training.— Day 21 Closing