Transformers & Self-Attention

The key insight: instead of sequential processing,
let every token directly attend to every other token simultaneously.

1

Key: Each word becomes a vector (list of ~512 numbers). Similar words end up nearby in this space. "cat" ≈ "kitten" ≈ "feline". The model learns this during training.

2

Self-attention: each token creates a weighted combination of ALL other tokens.
The weights tell you what to "look at". They're learned from data.

3

4

This is the attention matrix. It's an N×N grid where N = number of tokens.

Each row = one token's attention across all others.

Each row sums to 1.0 (it's a probability distribution from softmax).

Darker = more attention. Look at the "it" row — it's heavily attending to "animal".

The math in one line:

Attention(Q,K,V) = softmax(QKᵀ / √d) · V

QKᵀ = the whole N×N score matrix at once

Important: This runs in parallel for all tokens simultaneously. Not sequentially. That's why transformers are fast on GPUs.

5

Why multiple heads? Each head learns a different type of relationship. One head might learn who refers to whom. Another learns subject-verb agreement. Another learns long-range dependencies.

GPT-3 uses 96 attention heads.
Each head gets a slice of the embedding dimension and specializes in different patterns.
The model figures out what to specialize in — nobody programs that in.

6

Feed-Forward Network (FFN):
Two linear layers with a nonlinearity. Each token processed independently.
Expands to 4× the width then back down. Stores "knowledge" about concepts.

Residual / Skip connections:
Output = input + what-we-learned. This means:
— gradients flow easily during training
— early layers don't get "destroyed"

Layer Norm:
Keeps the values in a healthy numerical range after each step. Makes training stable.

The key formula:

output = LayerNorm(x + Sublayer(x))

This is applied twice per block — once for attention, once for FFN.

Stack N of these blocks and you have a transformer.
GPT-2: 48 blocks. GPT-3: 96 blocks. Deeper = more abstract reasoning.

7

The transformer doesn't process tokens one-by-one.
It processes the entire sequence in parallel.
That's why it's so much faster than RNNs.

What the model learns:
All the weight matrices — W_Q, W_K, W_V, W_O in every head, plus the FFN weights. That's it. The architecture is fixed; the weights are learned.

GPT-3: 96 layers × 96 heads × weights = 175 billion parameters

8

What the model actually learns to do with attention:
Early layers: local syntax, adjacent word patterns.
Middle layers: coreference, entity relationships ("it" → "animal").
Late layers: abstract semantic reasoning, task-specific patterns.

Analogy: Attention is like knowing which pages in every book in the library to flip to before answering a question — and doing it for every word, simultaneously, every forward pass.

9

Transformers &Self-Attention

Transformers &
Self-Attention