Residual connections create redundancy in information flow, like diversified revenue streams. LayerNorm stabilizes. Together they make any depth trainable.— Day 18 Principle
I. The Block
Pre-norm Transformer block: LayerNorm → Attention → Residual, LayerNorm → FFN → Residual.
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
self.sa = MultiHeadAttention(n_head, n_embd//n_head)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4*n_embd), nn.ReLU(),
nn.Linear(4*n_embd, n_embd), nn.Dropout(0.2))
Why 4x Expansion in FFN?
The FFN expands by 4x then contracts. This wider thinking space allows nonlinear computation per position. GPT-2 (768-dim) uses 3072 inner dim.
IV. The Matrix
Deep Intuition
Surface Only
Quick
🎯
DO FIRST
Implement Block. Stack 4 blocks. Train.
⏭
IF TIME
Add Dropout to attention and FFN.
Slow
🖐
CAREFULLY
Remove residuals. Observe training failure.
🚫
AVOID
Flash attention or KV caching.
V. Today’s Deliverables
- Block with pre-norm architecture
- FeedForward with 4x expansion
- LayerNorm understanding
- Residual connections
- Stack 4 blocks and train
- Ablation: remove components
The Transformer block is the Lego brick of modern AI. Tomorrow: the full GPT.— Day 18 Closing