Technique · alignment

Red-Teaming with Preference Models

Using an LM to generate adversarial prompts that elicit harmful behavior, scaling safety evaluation far beyond human red-teaming.

Origin: Google DeepMind, 2022-02Read origin paper →Also known as: Red Teaming, Adversarial evaluation

Products deploying

—

Avg research → prod

—

First commercial deploy

Deployment timeline

No verified deployments yet in our tracked product set.