Transformers &
Self-Attention

How they actually work — from scratch
→ think first, then build up from fundamentals
The animal didn't cross ...it... what does "it" refer to? attention needs to reach back here Old way (RNN): process left→right, one word at a time. By the time you reach "it", context is faded. New way (Transformer): every word looks at every other word. All at once.
The key insight: instead of sequential processing,
let every token directly attend to every other token simultaneously.
1
"cat" word tokenize 2368 token ID lookup embedding (512 dims) 0.23 −0.71 0.08 0.94 −0.35 0.12 + position (sin/cos) = input vector * "Dog bites man" ≠ "Man bites dog" — order matters, so we encode position
Key: Each word becomes a vector (list of ~512 numbers). Similar words end up nearby in this space. "cat" ≈ "kitten" ≈ "feline". The model learns this during training.
2
Every token asks: "which other tokens should I pay attention to?" The animal didn't cross it ← focus token tired 0.72 — high attention! 0.02 0.18 0.08 Attention weights for "it" (sum to 1.0): The 0.02 animal 0.72 didn't 0.08 cross tired 0.18
Self-attention: each token creates a weighted combination of ALL other tokens.
The weights tell you what to "look at". They're learned from data.
3
For each token, we create THREE vectors from the input: x input vector × W_Q × W_K × W_V Q Query "What am I looking for?" K Key "What do I have?" V Value "My actual information" Computing attention score between two tokens: Q_i · K_j ÷ √d = score dot product measures "how similar" Q and K are ÷√d keeps gradients stable during training Then softmax to get probabilities: softmax([score_1, score_2, ..., score_n]) = [0.02, 0.72, 0.08, 0.05, 0.18, 0.05, ...] all positive, sum to 1.0 Finally, weighted sum of Values: output = 0.02·V_1 + 0.72·V_animal + 0.08·V_3 ... → a blend of all value vectors, weighted by relevance
4
The animal didn't it tired ← Keys (what I attend TO) The animal didn't it tired Queries (who is attending) .65 .12 .10 .08 .05 .10 .72 .09 .05 .04 .08 .28 .48 .10 .06 .02 .72 .08 .05 .18 ← "it" row .05 .32 .38 .15 .10 low attention high attention

This is the attention matrix. It's an N×N grid where N = number of tokens.


Each row = one token's attention across all others.


Each row sums to 1.0 (it's a probability distribution from softmax).


Darker = more attention. Look at the "it" row — it's heavily attending to "animal".

The math in one line:

Attention(Q,K,V) = softmax(QKᵀ / √d) · V

QKᵀ = the whole N×N score matrix at once
Important: This runs in parallel for all tokens simultaneously. Not sequentially. That's why transformers are fast on GPUs.
5
input Head 1 syntax / grammar Head 2 coreference ("it"→"animal") Head 3 subject-verb relationships Head 4 long-range dependencies · · · Concat + Linear projection output (same shape as input!)
Why multiple heads? Each head learns a different type of relationship. One head might learn who refers to whom. Another learns subject-verb agreement. Another learns long-range dependencies.
GPT-3 uses 96 attention heads.
Each head gets a slice of the embedding dimension and specializes in different patterns.
The model figures out what to specialize in — nobody programs that in.
6
input x skip / residual Multi-Head Attention + Add & LayerNorm Feed-Forward Network Linear → GELU → Linear + Add & LayerNorm output z × N layers

Feed-Forward Network (FFN):
Two linear layers with a nonlinearity. Each token processed independently.
Expands to 4× the width then back down. Stores "knowledge" about concepts.


Residual / Skip connections:
Output = input + what-we-learned. This means:
— gradients flow easily during training
— early layers don't get "destroyed"


Layer Norm:
Keeps the values in a healthy numerical range after each step. Makes training stable.

The key formula:

output = LayerNorm(x + Sublayer(x))

This is applied twice per block — once for attention, once for FFN.
Stack N of these blocks and you have a transformer.
GPT-2: 48 blocks. GPT-3: 96 blocks. Deeper = more abstract reasoning.
7
Tokenize words → IDs "cat" → 2368 Embed IDs → vectors + position N × Transformer Block Multi-Head Attn + FFN + LayerNorm + Residuals the expensive part Linear + Softmax project to vocab Next Token "mat" Training: compare predicted token to real token → compute loss → backprop → repeat × billions
The transformer doesn't process tokens one-by-one.
It processes the entire sequence in parallel.
That's why it's so much faster than RNNs.
What the model learns:
All the weight matrices — W_Q, W_K, W_V, W_O in every head, plus the FFN weights. That's it. The architecture is fixed; the weights are learned.
GPT-3: 96 layers × 96 heads × weights = 175 billion parameters
8
Attention = every token looks at every other Q·K = score how relevant is token j to token i? V = content weighted blend of all values Stack × N each layer learns more abstract Attention(Q,K,V) = softmax(QKᵀ / √d) · V
What the model actually learns to do with attention:
Early layers: local syntax, adjacent word patterns.
Middle layers: coreference, entity relationships ("it" → "animal").
Late layers: abstract semantic reasoning, task-specific patterns.
Analogy: Attention is like knowing which pages in every book in the library to flip to before answering a question — and doing it for every word, simultaneously, every forward pass.
9