Day 34 - DPO — Direct Preference Optimization

1-333435···80

The simplest solution that works is the best solution. DPO eliminates the reward model and PPO complexity while achieving comparable alignment quality.— Day 34 Principle

I. DPO vs RLHF

DPO shows that the RLHF objective can be reformulated as a simple classification loss on preference pairs. No reward model needed, no PPO needed. Just a modified cross-entropy loss.

# DPO loss (simplified)
log_ratio_chosen = log_prob(policy, chosen) - log_prob(ref, chosen)
log_ratio_rejected = log_prob(policy, rejected) - log_prob(ref, rejected)
loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected))

Why DPO Won

DPO is simpler to implement, more stable to train, and achieves comparable results to RLHF. It has become the default alignment method for most open-source LLMs.

V. Deliverables

DPO loss implementation
Reference model log probs
Beta tuning
Comparison to RLHF
Preference data format
Training loop

DPO simplifies alignment dramatically. Tomorrow: how to evaluate all of this.— Day 34 Closing