1-202122···80
← Back to index
PHASE 3 LLM Architecture · Day 21 of 80 · Raschka LLMs From Scratch

Tokenization — Byte Pair Encoding from Scratch

Build BPE tokenizer from scratch: count pairs, merge most frequent, build vocabulary.

The resolution at which you measure determines what you can see. Character-level is too fine; word-level too coarse. BPE finds the Goldilocks granularity automatically.— Day 21 Principle

I. Why Tokenization Matters

Tokenization converts text to integers. BPE starts with characters and iteratively merges the most frequent pair into a new token. GPT-2 uses 50,257 tokens; GPT-4 uses ~100K.

# BPE algorithm (simplified) vocab = {idx: bytes([idx]) for idx in range(256)} for i in range(num_merges): pair = most_frequent_pair(ids) ids = merge(ids, pair, 256+i) vocab[256+i] = vocab[pair[0]] + vocab[pair[1]]

Tokenization Is the First Bottleneck

Bad tokenization wastes context window on subword fragments. Good tokenization compresses common patterns into single tokens, giving the model more effective context.

V. Deliverables

The tokenizer is the model’s lens on language. Tomorrow: preparing data for LLM training.— Day 21 Closing