1-151617···80
← Back to index
PHASE 2 Deep Networks · Day 16 of 80 · makemore & GPT

Self-Attention from Scratch

The mechanism powering all modern LLMs: queries, keys, values, and scaled dot-product attention.

In a portfolio, every position is evaluated relative to every other. Self-attention does this for tokens: each computes relevance to all others, then aggregates accordingly.— Day 16 Principle

I. The Attention Formula

Attention(Q,K,V) = softmax(QKT/√dk)V. Each token produces a query, key, and value. Dot product measures relevance. Softmax normalizes. Weighted sum aggregates.

B, T, C = 4, 8, 32 head_size = 16 k = key(x) # [B, T, 16] q = query(x) # [B, T, 16] v = value(x) # [B, T, 16] wei = q @ k.transpose(-2,-1) * head_size**(-0.5) tril = torch.tril(torch.ones(T,T)) wei = wei.masked_fill(tril==0, float('-inf')) wei = F.softmax(wei, dim=-1) out = wei @ v

Why Scale by √dk?

Without scaling, dot products grow with dimension, pushing softmax to extremes. Scaling keeps variance at ~1.0 for informative distributions.

The Causal Mask

Token at position t must not attend to future tokens. The triangular mask with -inf ensures zero attention weight after softmax.

IV. The Matrix

Deep Intuition
Surface Only
Quick
🎯

DO FIRST

Single-head self-attention with causal mask.

IF TIME

Visualize attention weights as heatmap.

Slow
🖐

CAREFULLY

Compare to simple averaging. See why learned weighting wins.

🚫

AVOID

Multi-head attention. Master single-head first.

V. Today’s Deliverables

Self-attention is the atom of the Transformer. Tomorrow: multiple heads.— Day 16 Closing