The simplest solution that works is the best solution. DPO eliminates the reward model and PPO complexity while achieving comparable alignment quality.— Day 34 Principle
I. DPO vs RLHF
DPO shows that the RLHF objective can be reformulated as a simple classification loss on preference pairs. No reward model needed, no PPO needed. Just a modified cross-entropy loss.
# DPO loss (simplified)
log_ratio_chosen = log_prob(policy, chosen) - log_prob(ref, chosen)
log_ratio_rejected = log_prob(policy, rejected) - log_prob(ref, rejected)
loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
Why DPO Won
DPO is simpler to implement, more stable to train, and achieves comparable results to RLHF. It has become the default alignment method for most open-source LLMs.
V. Deliverables
- DPO loss implementation
- Reference model log probs
- Beta tuning
- Comparison to RLHF
- Preference data format
- Training loop
DPO simplifies alignment dramatically. Tomorrow: how to evaluate all of this.— Day 34 Closing