1-252627···80
← Back to index
PHASE 3 LLM Architecture · Day 26 of 80 · Raschka LLMs From Scratch

Pre-training Setup — Weight Init & LR Warmup

Initialize weights correctly and implement the learning rate schedule that makes large model training stable.

The first few moves in chess determine the game. Weight initialization and learning rate warmup determine whether training converges or collapses.— Day 26 Principle

I. Weight Initialization

GPT-2 uses modified initialization: normal with std=0.02 for most layers, scaled by 1/sqrt(2N) for residual projections where N is the number of layers.

def _init_weights(module): if isinstance(module, nn.Linear): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

II. Learning Rate Warmup + Cosine Decay

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr): if step < warmup_steps: return max_lr * step / warmup_steps decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps) return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * decay_ratio))

V. Deliverables

Proper setup is 80% of successful training. Tomorrow: actually pre-training.— Day 26 Closing