1-242526···80
← Back to index
PHASE 3 LLM Architecture · Day 25 of 80 · Raschka LLMs From Scratch

The Full GPT-2 Architecture — Layer by Layer

Raschka’s complete GPT-2 implementation: every layer, every parameter, every design decision.

Understanding a system layer by layer is like reading a company’s balance sheet line by line. No shortcuts, no abstractions. Pure comprehension.— Day 25 Principle

I. GPT-2 Architecture

GPT-2 Small: 12 layers, 12 heads, d_model=768, FFN inner=3072, vocab=50257, context=1024. 124M parameters. Raschka builds every component from scratch in Python.

class GPTModel(nn.Module): def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

Weight Tying

GPT-2 ties the output projection weights with the token embedding weights. This reduces parameters and enforces consistency between the input and output token representations.

V. Deliverables

You now understand GPT-2 at every level. Tomorrow: pre-training setup.— Day 25 Closing