Day 9 — MLP Language Model (Bengio et al. 2003)

Context is everything. A single data point tells you nothing — you need the trend, the neighborhood, the history. The bigram model sees one character. The MLP sees three. That small increase in context produces a dramatic improvement in prediction quality. This is the central lesson of Bengio 2003, and it foreshadows the entire Transformer revolution: more context = better predictions. — Day 9 Principle, adapted from the Marks framework

I. Architecture — Embeddings + Hidden Layer + Softmax

The MLP language model has three components: (1) an embedding table C that maps each character to a dense vector, (2) a hidden layer with tanh activation, (3) an output layer that produces logits over the 27-character vocabulary. The context window (block size) is 3 characters.

Exhibit A — MLP Language Model Architecture (Bengio 2003)

II. The Code — Full Implementation

block_size = 3   # context length
emb_dim = 2      # embedding dimensions
n_hidden = 100   # hidden layer neurons
vocab_size = 27

# Parameters
C  = torch.randn((vocab_size, emb_dim))
W1 = torch.randn((emb_dim * block_size, n_hidden)) * 0.2
b1 = torch.randn(n_hidden) * 0.01
W2 = torch.randn((n_hidden, vocab_size)) * 0.01
b2 = torch.randn(vocab_size) * 0

parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True

# Training loop
for i in range(200000):
    # Mini-batch
    ix = torch.randint(0, Xtr.shape[0], (32,))

    # Forward
    emb = C[Xtr[ix]]               # [32, 3, 2]
    h = torch.tanh(emb.view(-1, emb_dim * block_size) @ W1 + b1)
    logits = h @ W2 + b2          # [32, 27]
    loss = F.cross_entropy(logits, Ytr[ix])

    # Backward
    for p in parameters:
        p.grad = None
    loss.backward()

    # Update
    lr = 0.1 if i < 100000 else 0.01
    for p in parameters:
        p.data += -lr * p.grad
  

Embedding = Learned Representation

The embedding table C maps each character to a 2D point. After training, similar characters cluster together. Plot C in 2D and you’ll see vowels group, consonants group, and the special token sits alone. The network discovered phonetic structure through gradient descent alone.

F.cross_entropy Combines Three Steps

F.cross_entropy(logits, targets) computes softmax + log + NLL in one numerically stable operation. Never manually compute logits.exp() / sum in production code — it overflows for large logits. Always use F.cross_entropy.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Build the dataset: context window of 3 characters → target. Train the MLP. Achieve NLL < 2.2 on dev set.

⏭️

DO IF TIME

Visualize the learned 2D embedding C. Plot each character at its learned coordinates. Look for clusters.

Slow but Worth It

🖐

DO CAREFULLY

Vary emb_dim (2, 10, 30) and n_hidden (50, 100, 300). Track train vs. dev loss. Find the overfitting boundary.

🚫

AVOID TODAY

Using nn.Module or any PyTorch high-level API. Keep the raw tensor math. Abstractions come later.

V. Today’s Deliverables

Dataset: Build (X, Y) pairs with block_size=3 context windows
Embedding: Learn C[27, emb_dim] via gradient descent
MLP: Hidden layer with tanh, output layer with cross_entropy loss
Mini-batching: Sample random 32-element batches per step
Beat bigram: Achieve dev NLL < 2.2 (bigram baseline: 2.45)
Sample names: Generate 20 names from the trained MLP model

Bengio 2003 proved that neural networks could model language. The names generated by today’s MLP are noticeably better than the bigram’s output — more name-like, with better structure. The gap from 2.45 to ∼2.15 NLL doesn’t sound large, but in language modeling, every 0.1 represents an exponential improvement in prediction quality. Tomorrow you fine-tune: embeddings, learning rate schedules, and hyperparameter search. — Day 9 Closing Principle

Day 9 Notebook — MLP Language Model (Bengio 2003) Runnable Python

Full MLP language model: learned embeddings, context windows, mini-batch training, name generation, and loss curves.

▶ Open in Colab View on GitHub nbviewer