PHASE 1 Foundations · Day 9 of 80 · Neural Networks & Backprop
MLP Language Model — Bengio et al. 2003
The paper that launched neural language models: look at 3 previous characters, embed them, pass through a hidden layer, predict the next one. Beat the bigram baseline.
Context is everything. A single data point tells you nothing — you need the trend, the neighborhood,
the history. The bigram model sees one character. The MLP sees three. That small increase in context
produces a dramatic improvement in prediction quality. This is the central lesson of Bengio 2003, and
it foreshadows the entire Transformer revolution: more context = better predictions.
— Day 9 Principle, adapted from the Marks framework
I. Architecture — Embeddings + Hidden Layer + Softmax
The MLP language model has three components: (1) an embedding table C that maps each character
to a dense vector, (2) a hidden layer with tanh activation, (3) an output layer
that produces logits over the 27-character vocabulary. The context window (block size) is 3 characters.
Exhibit A — MLP Language Model Architecture (Bengio 2003)
II. The Code — Full Implementation
block_size = 3# context length
emb_dim = 2# embedding dimensions
n_hidden = 100# hidden layer neurons
vocab_size = 27# Parameters
C = torch.randn((vocab_size, emb_dim))
W1 = torch.randn((emb_dim * block_size, n_hidden)) *0.2
b1 = torch.randn(n_hidden) *0.01
W2 = torch.randn((n_hidden, vocab_size)) *0.01
b2 = torch.randn(vocab_size) *0
parameters = [C, W1, b1, W2, b2]
for p in parameters:
p.requires_grad = True# Training loopfor i inrange(200000):
# Mini-batch
ix = torch.randint(0, Xtr.shape[0], (32,))
# Forward
emb = C[Xtr[ix]] # [32, 3, 2]
h = torch.tanh(emb.view(-1, emb_dim * block_size) @ W1 + b1)
logits = h @ W2 + b2 # [32, 27]
loss = F.cross_entropy(logits, Ytr[ix])
# Backwardfor p in parameters:
p.grad = None
loss.backward()
# Update
lr = 0.1if i < 100000else0.01for p in parameters:
p.data +=-lr * p.grad
Embedding = Learned Representation
The embedding table C maps each character to a 2D point. After training, similar characters cluster together. Plot C in 2D and you’ll see vowels group, consonants group, and the special token sits alone. The network discovered phonetic structure through gradient descent alone.
F.cross_entropy Combines Three Steps
F.cross_entropy(logits, targets) computes softmax + log + NLL in one numerically stable operation. Never manually compute logits.exp() / sum in production code — it overflows for large logits. Always use F.cross_entropy.
IV. The Matrix — What Matters Today
Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯
DO FIRST
Build the dataset: context window of 3 characters → target. Train the MLP. Achieve NLL < 2.2 on dev set.
⏭️
DO IF TIME
Visualize the learned 2D embedding C. Plot each character at its learned coordinates. Look for clusters.
Slow but Worth It
🖐
DO CAREFULLY
Vary emb_dim (2, 10, 30) and n_hidden (50, 100, 300). Track train vs. dev loss. Find the overfitting boundary.
🚫
AVOID TODAY
Using nn.Module or any PyTorch high-level API. Keep the raw tensor math. Abstractions come later.
V. Today’s Deliverables
Dataset: Build (X, Y) pairs with block_size=3 context windows
Embedding: Learn C[27, emb_dim] via gradient descent
MLP: Hidden layer with tanh, output layer with cross_entropy loss
Mini-batching: Sample random 32-element batches per step
Beat bigram: Achieve dev NLL < 2.2 (bigram baseline: 2.45)
Sample names: Generate 20 names from the trained MLP model
Bengio 2003 proved that neural networks could model language. The names generated by today’s MLP are
noticeably better than the bigram’s output — more name-like, with better structure. The gap from 2.45 to
∼2.15 NLL doesn’t sound large, but in language modeling, every 0.1 represents an exponential improvement in
prediction quality. Tomorrow you fine-tune: embeddings, learning rate schedules, and hyperparameter search.
— Day 9 Closing Principle
Day 9 Notebook — MLP Language Model (Bengio 2003)Runnable Python
Full MLP language model: learned embeddings, context windows, mini-batch training, name generation, and loss curves.