I. Learning Rate Finder — The Log-Space Sweep
Before training, sweep the learning rate from 10−3 to 100 on a log scale. Plot loss vs. learning rate. The optimal initial LR is just before the loss starts increasing — typically where the curve is steepest downward. This 30-second experiment saves hours of wasted training.
II. Hyperparameter Sweep — Embedding & Hidden Dimensions
The key hyperparameters for the MLP language model are: embedding dimension (how many features per character), hidden layer size (capacity of the nonlinear transform), block size (how many previous characters to look at), and batch size (samples per gradient step).
| Config | emb_dim | hidden | block | params | dev NLL |
|---|---|---|---|---|---|
| Baseline | 2 | 100 | 3 | 3.5K | 2.17 |
| Wider embed | 10 | 200 | 3 | 11.7K | 2.08 |
| Longer context | 10 | 200 | 8 | 22K | 2.03 |
| Overfitting | 30 | 500 | 8 | 140K | 2.10 ↑ |
The Overfitting Signal
When train loss keeps dropping but dev loss starts rising, the model is memorizing training data rather than learning general patterns. The gap between train and dev loss is the overfitting signal. The “best” model is the one with the lowest dev loss, not the lowest train loss.
III. Learning Rate Decay — Start Fast, Finish Precise
The standard pattern: start with a high learning rate (0.1) for fast initial learning, then decay to a lower rate (0.01) for fine convergence. In practice, this is implemented as a simple step schedule or the more sophisticated cosine annealing that modern LLMs use.
IV. The Matrix — What Matters Today
DO FIRST
Run the LR finder. Find the optimal initial learning rate. Implement step decay. Beat your Day 9 result.
DO IF TIME
Try block_size = 4, 5, 8. More context helps — but at some point the hidden layer becomes the bottleneck.
DO CAREFULLY
Sweep emb_dim × hidden_size. Track train vs. dev loss for each. Find the configuration with lowest dev loss.
AVOID TODAY
Automated hyperparameter search (Optuna, Ray Tune). Do it manually — build intuition for how each knob affects the model.
V. Today’s Deliverables
- LR finder: Sweep 10−3 to 100, plot loss vs. LR, identify sweet spot
- LR schedule: Implement step decay (0.1 → 0.01 at midpoint)
- Hyperparameter sweep: Test at least 4 configurations of emb_dim × hidden_size
- Best model: Achieve the lowest possible dev NLL (<2.10 target)
- Overfitting check: Identify one configuration where dev loss > train loss by >0.05
- Phase 1 summary: Log all NLL baselines: bigram (2.45), MLP baseline (2.17), tuned MLP (your best)
Learning rate finder, step decay schedule, embedding visualization, and hyperparameter summary.