Day 10 — Embeddings, Learning Rate Schedules & Hyperparameters

The difference between a good fund and a great fund is not the strategy — it is the calibration. Position sizing, rebalancing frequency, risk limits. In neural networks, the equivalent calibrations are hyperparameters: embedding dimension, hidden layer size, learning rate schedule, context length. Today you learn to tune them systematically. — Day 10 Principle, adapted from the Marks framework

I. Learning Rate Finder — The Log-Space Sweep

Before training, sweep the learning rate from 10⁻³ to 10⁰ on a log scale. Plot loss vs. learning rate. The optimal initial LR is just before the loss starts increasing — typically where the curve is steepest downward. This 30-second experiment saves hours of wasted training.

Exhibit A — Learning Rate Finder: Loss vs. LR (Log Scale)

# Learning rate finder
lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre

for i in range(1000):
    # ... forward pass, loss computation ...
    loss.backward()
    lr = lrs[i]
    for p in parameters:
        p.data += -lr * p.grad
    lri.append(lre[i])
    lossi.append(loss.item())

# Plot: plt.plot(lri, lossi) — pick lr where curve is steepest
  

II. Hyperparameter Sweep — Embedding & Hidden Dimensions

The key hyperparameters for the MLP language model are: embedding dimension (how many features per character), hidden layer size (capacity of the nonlinear transform), block size (how many previous characters to look at), and batch size (samples per gradient step).

Config	emb_dim	hidden	block	params	dev NLL
Baseline	2	100	3	3.5K	2.17
Wider embed	10	200	3	11.7K	2.08
Longer context	10	200	8	22K	2.03
Overfitting	30	500	8	140K	2.10 ↑

The Overfitting Signal

When train loss keeps dropping but dev loss starts rising, the model is memorizing training data rather than learning general patterns. The gap between train and dev loss is the overfitting signal. The “best” model is the one with the lowest dev loss, not the lowest train loss.

III. Learning Rate Decay — Start Fast, Finish Precise

The standard pattern: start with a high learning rate (0.1) for fast initial learning, then decay to a lower rate (0.01) for fine convergence. In practice, this is implemented as a simple step schedule or the more sophisticated cosine annealing that modern LLMs use.

# Step decay: high lr for first 100k steps, low lr for rest
for i in range(200000):
    # ... forward + backward ...
    lr = 0.1 if i < 100000 else 0.01
    for p in parameters:
        p.data += -lr * p.grad

# The pattern in modern LLMs: cosine with warmup
# lr = max_lr * 0.5 * (1 + cos(pi * step / total_steps))
  

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Run the LR finder. Find the optimal initial learning rate. Implement step decay. Beat your Day 9 result.

⏭️

DO IF TIME

Try block_size = 4, 5, 8. More context helps — but at some point the hidden layer becomes the bottleneck.

Slow but Worth It

🖐

DO CAREFULLY

Sweep emb_dim × hidden_size. Track train vs. dev loss for each. Find the configuration with lowest dev loss.

🚫

AVOID TODAY

Automated hyperparameter search (Optuna, Ray Tune). Do it manually — build intuition for how each knob affects the model.

V. Today’s Deliverables

LR finder: Sweep 10⁻³ to 10⁰, plot loss vs. LR, identify sweet spot
LR schedule: Implement step decay (0.1 → 0.01 at midpoint)
Hyperparameter sweep: Test at least 4 configurations of emb_dim × hidden_size
Best model: Achieve the lowest possible dev NLL (<2.10 target)
Overfitting check: Identify one configuration where dev loss > train loss by >0.05
Phase 1 summary: Log all NLL baselines: bigram (2.45), MLP baseline (2.17), tuned MLP (your best)

Phase 1 is complete. You built an autograd engine, a neuron, an MLP, a bigram model, and a neural language model — all from scratch. You understand backprop, gradient descent, softmax, cross-entropy, embeddings, and data splits at a level that most practitioners never reach. Phase 2 begins with the hard parts: activations, batch normalization, and deep network training. The foundation you’ve built will hold. — Day 10 Closing Principle · End of Phase 1

Day 10 Notebook — Embeddings, LR Schedules & Hyperparameters Runnable Python

Learning rate finder, step decay schedule, embedding visualization, and hyperparameter summary.

▶ Open in Colab View on GitHub nbviewer