1-171819···80
← Back to index
PHASE 2 Deep Networks · Day 18 of 80 · makemore & GPT

The Transformer Block — LayerNorm, Residuals, FFN

Assemble the complete block: attention + FFN + residuals + LayerNorm. The building block of GPT.

Residual connections create redundancy in information flow, like diversified revenue streams. LayerNorm stabilizes. Together they make any depth trainable.— Day 18 Principle

I. The Block

Pre-norm Transformer block: LayerNorm → Attention → Residual, LayerNorm → FFN → Residual.

class Block(nn.Module): def __init__(self, n_embd, n_head): super().__init__() self.sa = MultiHeadAttention(n_head, n_embd//n_head) self.ffwd = FeedForward(n_embd) self.ln1 = nn.LayerNorm(n_embd) self.ln2 = nn.LayerNorm(n_embd) def forward(self, x): x = x + self.sa(self.ln1(x)) x = x + self.ffwd(self.ln2(x)) return x class FeedForward(nn.Module): def __init__(self, n_embd): super().__init__() self.net = nn.Sequential( nn.Linear(n_embd, 4*n_embd), nn.ReLU(), nn.Linear(4*n_embd, n_embd), nn.Dropout(0.2))

Why 4x Expansion in FFN?

The FFN expands by 4x then contracts. This wider thinking space allows nonlinear computation per position. GPT-2 (768-dim) uses 3072 inner dim.

IV. The Matrix

Deep Intuition
Surface Only
Quick
🎯

DO FIRST

Implement Block. Stack 4 blocks. Train.

IF TIME

Add Dropout to attention and FFN.

Slow
🖐

CAREFULLY

Remove residuals. Observe training failure.

🚫

AVOID

Flash attention or KV caching.

V. Today’s Deliverables

The Transformer block is the Lego brick of modern AI. Tomorrow: the full GPT.— Day 18 Closing