The first few moves in chess determine the game. Weight initialization and learning rate warmup determine whether training converges or collapses.— Day 26 Principle
I. Weight Initialization
GPT-2 uses modified initialization: normal with std=0.02 for most layers, scaled by 1/sqrt(2N) for residual projections where N is the number of layers.
def _init_weights(module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
II. Learning Rate Warmup + Cosine Decay
def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
if step < warmup_steps:
return max_lr * step / warmup_steps
decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * decay_ratio))
V. Deliverables
- Weight init with std=0.02
- Residual scaling
- LR warmup
- Cosine decay
- AdamW config
- Gradient clipping
Proper setup is 80% of successful training. Tomorrow: actually pre-training.— Day 26 Closing