In a portfolio, every position is evaluated relative to every other. Self-attention does this for tokens: each computes relevance to all others, then aggregates accordingly.— Day 16 Principle
I. The Attention Formula
Attention(Q,K,V) = softmax(QKT/√dk)V. Each token produces a query, key, and value. Dot product measures relevance. Softmax normalizes. Weighted sum aggregates.
B, T, C = 4, 8, 32
head_size = 16
k = key(x) # [B, T, 16]
q = query(x) # [B, T, 16]
v = value(x) # [B, T, 16]
wei = q @ k.transpose(-2,-1) * head_size**(-0.5)
tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v
Why Scale by √dk?
Without scaling, dot products grow with dimension, pushing softmax to extremes. Scaling keeps variance at ~1.0 for informative distributions.
The Causal Mask
Token at position t must not attend to future tokens. The triangular mask with -inf ensures zero attention weight after softmax.
IV. The Matrix
Deep Intuition
Surface Only
Quick
🎯
DO FIRST
Single-head self-attention with causal mask.
⏭
IF TIME
Visualize attention weights as heatmap.
Slow
🖐
CAREFULLY
Compare to simple averaging. See why learned weighting wins.
🚫
AVOID
Multi-head attention. Master single-head first.
V. Today’s Deliverables
- Q,K,V projections
- Dot-product attention with scaling
- Causal mask
- Weighted aggregation
- Shape verification
Self-attention is the atom of the Transformer. Tomorrow: multiple heads.— Day 16 Closing