Technique · alignment
Red-Teaming with Preference Models
Using an LM to generate adversarial prompts that elicit harmful behavior, scaling safety evaluation far beyond human red-teaming.
Origin: Google DeepMind, 2022-02Read origin paper →Also known as: Red Teaming, Adversarial evaluation
0
Products deploying
—
Avg research → prod
—
First commercial deploy
Deployment timeline
No verified deployments yet in our tracked product set.