Day 18 - The Transformer Block — LayerNorm, Residuals, FFN

1-171819···80

Residual connections create redundancy in information flow, like diversified revenue streams. LayerNorm stabilizes. Together they make any depth trainable.— Day 18 Principle

I. The Block

Pre-norm Transformer block: LayerNorm → Attention → Residual, LayerNorm → FFN → Residual.

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        self.sa = MultiHeadAttention(n_head, n_embd//n_head)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd), nn.ReLU(),
            nn.Linear(4*n_embd, n_embd), nn.Dropout(0.2))

Why 4x Expansion in FFN?

The FFN expands by 4x then contracts. This wider thinking space allows nonlinear computation per position. GPT-2 (768-dim) uses 3072 inner dim.

IV. The Matrix

Deep Intuition

Surface Only

Quick

🎯

DO FIRST

Implement Block. Stack 4 blocks. Train.

⏭

IF TIME

Add Dropout to attention and FFN.

Slow

🖐

CAREFULLY

Remove residuals. Observe training failure.

🚫

AVOID

Flash attention or KV caching.

V. Today’s Deliverables

Block with pre-norm architecture
FeedForward with 4x expansion
LayerNorm understanding
Residual connections
Stack 4 blocks and train
Ablation: remove components

The Transformer block is the Lego brick of modern AI. Tomorrow: the full GPT.— Day 18 Closing