1-9 10 11 · · · 80
← Back to index
PHASE 1 Foundations · Day 10 of 80 · Neural Networks & Backprop

Embeddings, Learning Rate Schedules & Hyperparameters

Squeeze the last drop of performance from the MLP: tune embedding dimension, hidden size, learning rate decay, and block size. Phase 1 finale.

The difference between a good fund and a great fund is not the strategy — it is the calibration. Position sizing, rebalancing frequency, risk limits. In neural networks, the equivalent calibrations are hyperparameters: embedding dimension, hidden layer size, learning rate schedule, context length. Today you learn to tune them systematically. — Day 10 Principle, adapted from the Marks framework

I. Learning Rate Finder — The Log-Space Sweep

Before training, sweep the learning rate from 10−3 to 100 on a log scale. Plot loss vs. learning rate. The optimal initial LR is just before the loss starts increasing — typically where the curve is steepest downward. This 30-second experiment saves hours of wasted training.

Exhibit A — Learning Rate Finder: Loss vs. LR (Log Scale)
LOSS LEARNING RATE (log scale) 10³ 10² 10¹ 10&sup0; SWEET SPOT lr ≈ 0.1 divergence too slow
# Learning rate finder lre = torch.linspace(-3, 0, 1000) lrs = 10**lre for i in range(1000): # ... forward pass, loss computation ... loss.backward() lr = lrs[i] for p in parameters: p.data += -lr * p.grad lri.append(lre[i]) lossi.append(loss.item()) # Plot: plt.plot(lri, lossi) — pick lr where curve is steepest

II. Hyperparameter Sweep — Embedding & Hidden Dimensions

The key hyperparameters for the MLP language model are: embedding dimension (how many features per character), hidden layer size (capacity of the nonlinear transform), block size (how many previous characters to look at), and batch size (samples per gradient step).

Config emb_dim hidden block params dev NLL
Baseline 2 100 3 3.5K 2.17
Wider embed 10 200 3 11.7K 2.08
Longer context 10 200 8 22K 2.03
Overfitting 30 500 8 140K 2.10 ↑

The Overfitting Signal

When train loss keeps dropping but dev loss starts rising, the model is memorizing training data rather than learning general patterns. The gap between train and dev loss is the overfitting signal. The “best” model is the one with the lowest dev loss, not the lowest train loss.

III. Learning Rate Decay — Start Fast, Finish Precise

The standard pattern: start with a high learning rate (0.1) for fast initial learning, then decay to a lower rate (0.01) for fine convergence. In practice, this is implemented as a simple step schedule or the more sophisticated cosine annealing that modern LLMs use.

# Step decay: high lr for first 100k steps, low lr for rest for i in range(200000): # ... forward + backward ... lr = 0.1 if i < 100000 else 0.01 for p in parameters: p.data += -lr * p.grad # The pattern in modern LLMs: cosine with warmup # lr = max_lr * 0.5 * (1 + cos(pi * step / total_steps))

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Run the LR finder. Find the optimal initial learning rate. Implement step decay. Beat your Day 9 result.

⏭️

DO IF TIME

Try block_size = 4, 5, 8. More context helps — but at some point the hidden layer becomes the bottleneck.

Slow but Worth It
🖐

DO CAREFULLY

Sweep emb_dim × hidden_size. Track train vs. dev loss for each. Find the configuration with lowest dev loss.

🚫

AVOID TODAY

Automated hyperparameter search (Optuna, Ray Tune). Do it manually — build intuition for how each knob affects the model.

V. Today’s Deliverables

Phase 1 is complete. You built an autograd engine, a neuron, an MLP, a bigram model, and a neural language model — all from scratch. You understand backprop, gradient descent, softmax, cross-entropy, embeddings, and data splits at a level that most practitioners never reach. Phase 2 begins with the hard parts: activations, batch normalization, and deep network training. The foundation you’ve built will hold. — Day 10 Closing Principle · End of Phase 1
Day 10 Notebook — Embeddings, LR Schedules & Hyperparameters Runnable Python

Learning rate finder, step decay schedule, embedding visualization, and hyperparameter summary.