The key insight: instead of sequential processing,
let every token directly attend to every other token simultaneously.
1
2. Step 0 — Words become Vectors the prerequisite
Key: Each word becomes a vector (list of ~512 numbers). Similar words end up nearby in this space. "cat" ≈ "kitten" ≈ "feline". The model learns this during training.
2
3. Self-Attention — the Core Idea the whole game
Self-attention: each token creates a weighted combination of ALL other tokens.
The weights tell you what to "look at". They're learned from data.
3
4. The Q, K, V Mechanism how attention weights are computed
4
5. The Attention Matrix all tokens × all tokens
This is the attention matrix. It's an N×N grid where N = number of tokens.
Each row = one token's attention across all others.
Each row sums to 1.0 (it's a probability distribution from softmax).
Darker = more attention. Look at the "it" row — it's heavily attending to "animal".
The math in one line:
Attention(Q,K,V) = softmax(QKᵀ / √d) · V
QKᵀ = the whole N×N score matrix at once
Important: This runs in parallel for all tokens simultaneously. Not sequentially. That's why transformers are fast on GPUs.
5
6. Multi-Head Attention looking through multiple lenses
Why multiple heads? Each head learns a different type of relationship. One head might learn who refers to whom. Another learns subject-verb agreement. Another learns long-range dependencies.
GPT-3 uses 96 attention heads.
Each head gets a slice of the embedding dimension and specializes in different patterns. The model figures out what to specialize in — nobody programs that in.
6
7. One Transformer Block the repeating unit
Feed-Forward Network (FFN):
Two linear layers with a nonlinearity. Each token processed independently.
Expands to 4× the width then back down. Stores "knowledge" about concepts.
Residual / Skip connections:
Output = input + what-we-learned. This means:
— gradients flow easily during training
— early layers don't get "destroyed"
Layer Norm:
Keeps the values in a healthy numerical range after each step. Makes training stable.
The key formula:
output = LayerNorm(x + Sublayer(x))
This is applied twice per block — once for attention, once for FFN.
Stack N of these blocks and you have a transformer.
GPT-2: 48 blocks. GPT-3: 96 blocks. Deeper = more abstract reasoning.
7
8. The Big Picture all together now
The transformer doesn't process tokens one-by-one.
It processes the entire sequence in parallel. That's why it's so much faster than RNNs.
What the model learns:
All the weight matrices — W_Q, W_K, W_V, W_O in every head, plus the FFN weights. That's it. The architecture is fixed; the weights are learned.
What the model actually learns to do with attention:
Early layers: local syntax, adjacent word patterns.
Middle layers: coreference, entity relationships ("it" → "animal").
Late layers: abstract semantic reasoning, task-specific patterns.
Analogy: Attention is like knowing which pages in every book in the library to flip to before answering a question — and doing it for every word, simultaneously, every forward pass.