Day 33 - RLHF — Reward Models & PPO Basics

1-323334···80

Markets are the ultimate reward model: they aggregate millions of human preferences into prices. RLHF does the same for language: aggregate human preferences into a signal that guides the model.— Day 33 Principle

I. The RLHF Pipeline

Three stages: (1) SFT on instruction data, (2) train a reward model on human preference pairs (chosen vs rejected), (3) optimize the SFT model against the reward model using PPO while staying close to the SFT baseline (KL penalty).

# Reward model: same architecture, scalar output
reward = reward_model(response)  # scalar score
# PPO objective
loss = -reward + beta * KL(policy || reference)

V. Deliverables

Reward model training
Preference pairs
PPO basics
KL penalty
Reference model
RLHF pipeline

RLHF aligns models with human values. Tomorrow: a simpler alternative.— Day 33 Closing