Day 16 - Self-Attention from Scratch

1-151617···80

In a portfolio, every position is evaluated relative to every other. Self-attention does this for tokens: each computes relevance to all others, then aggregates accordingly.— Day 16 Principle

I. The Attention Formula

Attention(Q,K,V) = softmax(QK^T/√d_k)V. Each token produces a query, key, and value. Dot product measures relevance. Softmax normalizes. Weighted sum aggregates.

B, T, C = 4, 8, 32
head_size = 16
k = key(x)    # [B, T, 16]
q = query(x)  # [B, T, 16]
v = value(x)  # [B, T, 16]
wei = q @ k.transpose(-2,-1) * head_size**(-0.5)
tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v

Why Scale by √d_k?

Without scaling, dot products grow with dimension, pushing softmax to extremes. Scaling keeps variance at ~1.0 for informative distributions.

The Causal Mask

Token at position t must not attend to future tokens. The triangular mask with -inf ensures zero attention weight after softmax.

IV. The Matrix

Deep Intuition

Surface Only

Quick

🎯

DO FIRST

Single-head self-attention with causal mask.

⏭

IF TIME

Visualize attention weights as heatmap.

Slow

🖐

CAREFULLY

Compare to simple averaging. See why learned weighting wins.

🚫

AVOID

Multi-head attention. Master single-head first.

V. Today’s Deliverables

Q,K,V projections
Dot-product attention with scaling
Causal mask
Weighted aggregation
Shape verification

Self-attention is the atom of the Transformer. Tomorrow: multiple heads.— Day 16 Closing