1-181920···80
← Back to index
PHASE 2 Deep Networks · Day 19 of 80 · makemore & GPT

Building GPT from Scratch

Assemble token embeddings, positional embeddings, Transformer blocks, LayerNorm, and output projection into a complete GPT.

Building a firm from scratch means assembling research, trading, risk, compliance. Each well-understood individually. The art is in assembly. Today you assemble GPT.— Day 19 Principle

I. The Full GPT

class GPT(nn.Module): def __init__(self): super().__init__() self.token_emb = nn.Embedding(vocab_size, n_embd) self.pos_emb = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) self.lm_head = nn.Linear(n_embd, vocab_size) def forward(self, idx, targets=None): B, T = idx.shape x = self.token_emb(idx) + self.pos_emb(torch.arange(T)) x = self.blocks(x) logits = self.lm_head(self.ln_f(x)) loss = F.cross_entropy(logits.view(-1,logits.size(-1)), targets.view(-1)) if targets is not None else None return logits, loss def generate(self, idx, max_new): for _ in range(max_new): logits, _ = self(idx[:, -block_size:]) probs = F.softmax(logits[:,-1,:], dim=-1) idx = torch.cat((idx, torch.multinomial(probs,1)), dim=1) return idx

~10M Parameters

n_embd=384, n_head=6, n_layer=6, block_size=256, vocab=65: ~10.6M params. GPT-2 small: 124M. Same architecture, different scale.

IV. The Matrix

Deep Intuition
Surface Only
Quick
🎯

DO FIRST

Assemble full GPT. Print param count. Forward pass test.

IF TIME

Load Shakespeare dataset.

Slow
🖐

CAREFULLY

Trace shapes through full forward pass.

🚫

AVOID

Training. Assembly first.

V. Today’s Deliverables

You have built GPT. The actual architecture. Tomorrow you train it on Shakespeare.— Day 19 Closing