Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A person using a laptop with ChatGPT interface open, surrounded by colorful AI-related graphics and charts…

AI ResearchBreakthroughScore: 91

OpenAI: Small RL doses on 'beneficial traits' improve 44 of 53 safety benchmarks

OpenAI trained a model via RL on beneficial traits, improving 44 of 53 safety benchmarks. The method differs from Anthropic's constitution-based approach and makes models resistant to harmful steering.

AAAla SMITH & AI Research Desk·1d ago·3 min read··22 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderMulti-Source

Does training AI models on 'beneficial traits' like truthfulness generalize across domains and improve safety?

OpenAI researchers trained models via RL on realistic conversations targeting traits like truthfulness and corrigibility, improving 44 of 53 safety benchmarks. The approach, which differs from Anthropic's constitutional method, also made models resistant to harmful fine-tuning and adversarial prompts.

TL;DR

OpenAI trained models on truthfulness, corrigibility, fairness via RL. · Model improved on 44 of 53 safety benchmarks across domains. · Approach differs from Anthropic's constitution-based method.

OpenAI researchers trained a model via RL on 'beneficial traits' like truthfulness and corrigibility, improving 44 of 53 safety benchmarks. The method, which differs from Anthropic's constitution-based approach, also made models resistant to harmful fine-tuning and adversarial prompts.

Key facts

Model improved on 44 of 53 safety benchmarks.
Training on health data improved non-health evaluations like reward hacking.
Adversarial prompts had far less effect on beneficial-trait model.
Method differs from Anthropic's constitutional approach.
Researchers call it 'selective persistence' — resists harmful steering.

OpenAI has published a new alignment technique that uses small doses of reinforcement learning (RL) on specific behavioral traits — truthfulness, epistemic humility, corrigibility, transparency in reasoning, fairness, and concern for human well-being — to make models safer across domains. According to The Decoder, the researchers trained the model on realistic conversations covering healthcare, education, science, law, and engineering, mixing only a small share of this 'beneficial trait' data into the regular RL post-training pipeline.

Generalization across domains

The model improved on 44 out of 53 independent benchmarks measuring deception, honesty, sycophancy, reward hacking, and health and mental health scenarios. Training on health data alone also improved non-health evaluations like reward hacking and deception detection; the reverse held true — training without any health or science data still boosted performance on health benchmarks. The researchers conclude that RL training reinforces basic behavioral patterns that work across domains.

Resistance to adversarial steering

Adversarial prompts that badly destabilized the baseline model had far less effect on the beneficial-trait model. Harmful fine-tuning was also less able to erode the trained traits. The model stayed just as steerable for helpful instructions as before. The researchers call this 'selective persistence' — the model resists harmful steering without losing useful flexibility.

A different path than Anthropic

OpenAI's method differs sharply from Anthropic's alignment approach. OpenAI relies on empirically measurable behavioral traits reinforced through RL in realistic scenarios. Anthropic works with an explicit 'Claude constitution,' a written values document that serves as the top-level guide for training and behavior. OpenAI leans heavily on benchmarks: 44 out of 53 evaluations show improvements that generalize across domains and evaluation methods. Anthropic takes a more principles-based approach where the model is supposed to understand why certain behaviors are desired, grounded in constitutional texts and high-quality training examples. A direct comparison of the two approaches doesn't exist yet.

What this means for alignment research

The finding that small RL doses on desired traits generalize across domains is notable because prior work has shown that misalignment from training on problematic behavior in one domain can spread to other areas. OpenAI's result suggests the reverse also works — good behavior generalizes just as broadly. This could have implications for how AI companies structure their safety training pipelines, potentially reducing the need for exhaustive domain-specific safety data.

Key Takeaways

OpenAI trained a model via RL on beneficial traits, improving 44 of 53 safety benchmarks.
The method differs from Anthropic's constitution-based approach and makes models resistant to harmful steering.

What to watch

Watch for a direct benchmark comparison between OpenAI's RL-based approach and Anthropic's constitutional method. No such comparison exists yet, but both labs are likely to publish follow-up work. Also watch for whether OpenAI integrates this technique into GPT-5.3-Codex-Spark or GPT-5.5 Instant.

Image description

Source: the-decoder.com

Source: gentic.news · 1d ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This is a significant result for alignment research because it demonstrates that small RL doses on desired traits can generalize across domains — a finding that runs counter to the belief that safety training must be domain-specific. The 'selective persistence' property, where the model resists harmful steering without losing helpful steerability, is particularly novel. However, the lack of a direct comparison to Anthropic's constitutional method is a gap. The paper's reliance on benchmarks also raises questions about how well these results translate to real-world adversarial pressure. The key question is whether this approach scales to frontier models like GPT-5.5 Instant and whether the traits generalize to more complex, open-ended scenarios.

#alignment #ai safety #reinforcement learning #openai

Compare side-by-side

OpenAI vs Anthropic

→

Mentioned in this article

OpenAI selective persistence Anthropic

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Opinion & Analysis2 shared topics

9-Line Agent: Cursor Beats Claude, OpenAI SDKs in Dev Build Test

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

OpenAI: Small RL doses on 'beneficial traits' improve 44 of 53 safety benchmarks

Generalization across domains

Resistance to adversarial steering

A different path than Anthropic

What this means for alignment research

Key Takeaways

What to watch

AI Analysis

✨AI Toolslive

Related Articles

AI model race tightens: 10 labs now clustered within months of each other

MCP Hits 10K Servers, 97M Monthly SDK Downloads by May 2026

OpenAI Stargate Data Centers Lag Behind Rivals in Cost, Timeline

ChatGPT Market Share Dips Below 50% for First Time, Sensor Tower Reports

Anthropic Reverses Claude Agent SDK Billing Overhaul Before Launch

9-Line Agent: Cursor Beats Claude, OpenAI SDKs in Dev Build Test

The framework underneath this story

More in AI Research

Qwen 2.5 7B Verbalized Confidence Is Epistemically Vacuous, Paper Finds

1.3B-Parameter Rectified Flow Transformer Generates Chest X-Rays

OpenAI Can Predict Model Failures via Past Chat Replay