1-323334···80
← Back to index
PHASE 3 LLM Architecture · Day 33 of 80 · Raschka LLMs From Scratch

RLHF — Reward Models & PPO Basics

Reinforcement Learning from Human Feedback: train a reward model, then optimize the LLM policy with PPO.

Markets are the ultimate reward model: they aggregate millions of human preferences into prices. RLHF does the same for language: aggregate human preferences into a signal that guides the model.— Day 33 Principle

I. The RLHF Pipeline

Three stages: (1) SFT on instruction data, (2) train a reward model on human preference pairs (chosen vs rejected), (3) optimize the SFT model against the reward model using PPO while staying close to the SFT baseline (KL penalty).

# Reward model: same architecture, scalar output reward = reward_model(response) # scalar score # PPO objective loss = -reward + beta * KL(policy || reference)

V. Deliverables

RLHF aligns models with human values. Tomorrow: a simpler alternative.— Day 33 Closing