Technique · alignment

Direct Preference Optimization (DPO)

Aligning LMs to preference data by directly optimizing a closed-form likelihood ratio, eliminating the reward model and RL loop of RLHF.

Origin: Stanford, 2023-05Read origin paper →Also known as: DPO

0

Products deploying

—

Avg research → prod

—

First commercial deploy

Deployment timeline

No verified deployments yet in our tracked product set.

Prior art

Reinforcement Learning from Human Feedback (RLHF)

Techniques built on this