A hierarchical thesis stacks micro, meso, and macro analysis. WaveNet uses the same principle: pairs merge in a tree, so receptive field grows exponentially.— Day 15 Principle
I. Hierarchical Merge
Instead of concatenating all context characters flat, merge them in pairs hierarchically. Three levels give 8-character context with nonlinear depth.
# WaveNet-style hierarchical model
emb = C[X] # [B, 8, emb_dim]
x = emb.view(B, 4, emb_dim*2)
x = torch.tanh(x @ W1 + b1)
x = x.view(B, 2, n_hidden*2)
x = torch.tanh(x @ W2 + b2)
x = x.view(B, n_hidden*2)
logits = x @ W3 + b3
Exponential Receptive Field
Each merge level doubles context. Three levels: 2x2x2=8 characters. Parameters grow linearly while context grows exponentially.
IV. The Matrix
Deep Intuition
Surface Only
Quick
🎯
DO FIRST
Implement hierarchical merge with block_size=8. Beat flat MLP.
⏭
IF TIME
Add BatchNorm between levels.
Slow
🖐
CAREFULLY
Trace character influence through levels.
🚫
AVOID
Full dilated convolutions.
V. Today’s Deliverables
- Hierarchical model: 3-level merge
- Beat flat MLP dev NLL
- BatchNorm between levels
- Receptive field trace
- Sampling from hierarchical model
The hierarchical structure previews Transformers. Tomorrow: self-attention.— Day 15 Closing