Understanding a system layer by layer is like reading a company’s balance sheet line by line. No shortcuts, no abstractions. Pure comprehension.— Day 25 Principle
I. GPT-2 Architecture
GPT-2 Small: 12 layers, 12 heads, d_model=768, FFN inner=3072, vocab=50257, context=1024. 124M parameters. Raschka builds every component from scratch in Python.
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
Weight Tying
GPT-2 ties the output projection weights with the token embedding weights. This reduces parameters and enforces consistency between the input and output token representations.
V. Deliverables
- Full GPT-2 implementation
- Config dict for hyperparameters
- Weight tying
- Parameter count verification
- Forward pass trace
- Load OpenAI weights
You now understand GPT-2 at every level. Tomorrow: pre-training setup.— Day 25 Closing