1-333435···80
← Back to index
PHASE 3 LLM Architecture · Day 34 of 80 · Raschka LLMs From Scratch

DPO — Direct Preference Optimization

Skip the reward model: DPO directly optimizes the policy on preference data with a simple loss function.

The simplest solution that works is the best solution. DPO eliminates the reward model and PPO complexity while achieving comparable alignment quality.— Day 34 Principle

I. DPO vs RLHF

DPO shows that the RLHF objective can be reformulated as a simple classification loss on preference pairs. No reward model needed, no PPO needed. Just a modified cross-entropy loss.

# DPO loss (simplified) log_ratio_chosen = log_prob(policy, chosen) - log_prob(ref, chosen) log_ratio_rejected = log_prob(policy, rejected) - log_prob(ref, rejected) loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected))

Why DPO Won

DPO is simpler to implement, more stable to train, and achieves comparable results to RLHF. It has become the default alignment method for most open-source LLMs.

V. Deliverables

DPO simplifies alignment dramatically. Tomorrow: how to evaluate all of this.— Day 34 Closing