Technique · training
Rejection Sampling Fine-Tuning
Sampling multiple completions, scoring with a reward model, and fine-tuning on the top samples — a simpler alternative to PPO used in Llama 2.
0
Products deploying
—
Avg research → prod
—
First commercial deploy
Deployment timeline
No verified deployments yet in our tracked product set.