Technique · alignment
Deep RL from Human Preferences
Learning reward functions from pairwise human comparisons rather than hand-coded rewards. The direct precursor to RLHF.
0
Products deploying
—
Avg research → prod
—
First commercial deploy
Deployment timeline
No verified deployments yet in our tracked product set.