Day 26 - Pre-training Setup — Weight Init & LR Warmup

1-252627···80

The first few moves in chess determine the game. Weight initialization and learning rate warmup determine whether training converges or collapses.— Day 26 Principle

I. Weight Initialization

GPT-2 uses modified initialization: normal with std=0.02 for most layers, scaled by 1/sqrt(2N) for residual projections where N is the number of layers.

def _init_weights(module):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

II. Learning Rate Warmup + Cosine Decay

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * decay_ratio))

V. Deliverables

Weight init with std=0.02
Residual scaling
LR warmup
Cosine decay
AdamW config
Gradient clipping

Proper setup is 80% of successful training. Tomorrow: actually pre-training.— Day 26 Closing