1-8 9 10 · · · 80
← Back to index
PHASE 1 Foundations · Day 9 of 80 · Neural Networks & Backprop

MLP Language Model — Bengio et al. 2003

The paper that launched neural language models: look at 3 previous characters, embed them, pass through a hidden layer, predict the next one. Beat the bigram baseline.

Context is everything. A single data point tells you nothing — you need the trend, the neighborhood, the history. The bigram model sees one character. The MLP sees three. That small increase in context produces a dramatic improvement in prediction quality. This is the central lesson of Bengio 2003, and it foreshadows the entire Transformer revolution: more context = better predictions. — Day 9 Principle, adapted from the Marks framework

I. Architecture — Embeddings + Hidden Layer + Softmax

The MLP language model has three components: (1) an embedding table C that maps each character to a dense vector, (2) a hidden layer with tanh activation, (3) an output layer that produces logits over the 27-character vocabulary. The context window (block size) is 3 characters.

Exhibit A — MLP Language Model Architecture (Bengio 2003)
CONTEXT (3 chars) char t-3 char t-2 char t-1 C[27, emb_dim] lookup → concat shape: [3 × emb_dim] Hidden Layer W1 @ x + b1 tanh activation Output Layer W2 @ h + b2 logits [27] softmax P(next) TOTAL PARAMS: 27×2 + (6×100+100) + (100×27+27) = 3481 From 729 (bigram) to 3481 params — and NLL drops from 2.45 to ~2.15

II. The Code — Full Implementation

block_size = 3 # context length emb_dim = 2 # embedding dimensions n_hidden = 100 # hidden layer neurons vocab_size = 27 # Parameters C = torch.randn((vocab_size, emb_dim)) W1 = torch.randn((emb_dim * block_size, n_hidden)) * 0.2 b1 = torch.randn(n_hidden) * 0.01 W2 = torch.randn((n_hidden, vocab_size)) * 0.01 b2 = torch.randn(vocab_size) * 0 parameters = [C, W1, b1, W2, b2] for p in parameters: p.requires_grad = True # Training loop for i in range(200000): # Mini-batch ix = torch.randint(0, Xtr.shape[0], (32,)) # Forward emb = C[Xtr[ix]] # [32, 3, 2] h = torch.tanh(emb.view(-1, emb_dim * block_size) @ W1 + b1) logits = h @ W2 + b2 # [32, 27] loss = F.cross_entropy(logits, Ytr[ix]) # Backward for p in parameters: p.grad = None loss.backward() # Update lr = 0.1 if i < 100000 else 0.01 for p in parameters: p.data += -lr * p.grad

Embedding = Learned Representation

The embedding table C maps each character to a 2D point. After training, similar characters cluster together. Plot C in 2D and you’ll see vowels group, consonants group, and the special token sits alone. The network discovered phonetic structure through gradient descent alone.

F.cross_entropy Combines Three Steps

F.cross_entropy(logits, targets) computes softmax + log + NLL in one numerically stable operation. Never manually compute logits.exp() / sum in production code — it overflows for large logits. Always use F.cross_entropy.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Build the dataset: context window of 3 characters → target. Train the MLP. Achieve NLL < 2.2 on dev set.

⏭️

DO IF TIME

Visualize the learned 2D embedding C. Plot each character at its learned coordinates. Look for clusters.

Slow but Worth It
🖐

DO CAREFULLY

Vary emb_dim (2, 10, 30) and n_hidden (50, 100, 300). Track train vs. dev loss. Find the overfitting boundary.

🚫

AVOID TODAY

Using nn.Module or any PyTorch high-level API. Keep the raw tensor math. Abstractions come later.

V. Today’s Deliverables

Bengio 2003 proved that neural networks could model language. The names generated by today’s MLP are noticeably better than the bigram’s output — more name-like, with better structure. The gap from 2.45 to ∼2.15 NLL doesn’t sound large, but in language modeling, every 0.1 represents an exponential improvement in prediction quality. Tomorrow you fine-tune: embeddings, learning rate schedules, and hyperparameter search. — Day 9 Closing Principle
Day 9 Notebook — MLP Language Model (Bengio 2003) Runnable Python

Full MLP language model: learned embeddings, context windows, mini-batch training, name generation, and loss curves.