Markets are the ultimate reward model: they aggregate millions of human preferences into prices. RLHF does the same for language: aggregate human preferences into a signal that guides the model.— Day 33 Principle
I. The RLHF Pipeline
Three stages: (1) SFT on instruction data, (2) train a reward model on human preference pairs (chosen vs rejected), (3) optimize the SFT model against the reward model using PPO while staying close to the SFT baseline (KL penalty).
# Reward model: same architecture, scalar output
reward = reward_model(response) # scalar score
# PPO objective
loss = -reward + beta * KL(policy || reference)
V. Deliverables
- Reward model training
- Preference pairs
- PPO basics
- KL penalty
- Reference model
- RLHF pipeline
RLHF aligns models with human values. Tomorrow: a simpler alternative.— Day 33 Closing