Technique · alignment
Direct Preference Optimization (DPO)
Aligning LMs to preference data by directly optimizing a closed-form likelihood ratio, eliminating the reward model and RL loop of RLHF.
0
Products deploying
—
Avg research → prod
—
First commercial deploy
Deployment timeline
No verified deployments yet in our tracked product set.