Baidu's RLVR Method Boosts Open-Ended Reasoning by 3.29 Points on 14B Model

Baidu researchers developed RLVR, a method that reformulates subjective tasks like writing as verifiable multiple-choice questions for reinforcement learning. This approach improved a 14B reasoning model by an average of 3.29 points across seven open-ended benchmarks compared to standard RLHF.

AAAla SMITH & AI Research Desk·Apr 13, 2026·7 min read··109 views·AI-Generated·Report error

Source: x.comvia @rohanpaul_aiSingle Source

TL;DR

Baidu researchers reformulated open-ended tasks as multiple-choice questions for RL training, achieving a 3.29-point average gain across seven benchmarks.

Baidu's RLVR Method Unlocks Reliable RL Training for Subjective AI Tasks

Reinforcement learning has powered breakthroughs in domains with clear, verifiable rewards—like solving math problems or writing code—where a single "correct" answer exists. But for open-ended, subjective tasks like creative writing or nuanced reasoning, where there's no single right output, RL has struggled. The reward signal becomes fuzzy, and training becomes unstable.

A new paper from Baidu researchers, "Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation," presents a clever workaround. Instead of asking a reward model "Is this response correct?" for a subjective answer, they ask it "Which of these two responses is better?" This simple reformulation turns open-ended evaluation into a clean, binary choice, creating a reliable reward signal that RL can effectively optimize.

Key Takeaways

Baidu researchers developed RLVR, a method that reformulates subjective tasks like writing as verifiable multiple-choice questions for reinforcement learning.
This approach improved a 14B reasoning model by an average of 3.29 points across seven open-ended benchmarks compared to standard RLHF.

What the Researchers Built: RLVR

The core method is called Reinforcement Learning from Verifiable Responses (RLVR). Its innovation is not a more sophisticated reward model architecture, but a fundamental change in the training task presented to the model.

During training, the system is shown a prompt and two candidate completions: one that is preferred (e.g., a more creative, coherent, or helpful response) and one that is rejected. The model's objective is to learn to identify the better response. This creates a contrastive verification habit—the model learns the process of comparing and selecting superior outputs, rather than just imitating a static notion of "good" taste from a dataset.

Key Results: A Clear Benchmark Lift

The team evaluated their method on a 14-billion-parameter reasoning model across seven diverse open-ended benchmarks, including creative writing, subjective QA, and reasoning tasks. They compared RLVR against a matched baseline trained with standard Reinforcement Learning from Human Feedback (RLHF).

RLHF Baseline Baseline (0.0 point delta) Trained on direct scoring of single responses. RLVR (Proposed) +3.29 points Trained on binary choice between response pairs. DPO Ablation Underperformed Baseline Confirms benefit is from contrastive verification, not just preference data.

The 3.29-point average improvement is significant, indicating the method's robustness across different types of open-ended tasks. The poor performance of a Direct Preference Optimization (DPO) ablation—which uses the same preference data but a different objective—suggests the gain is specific to the verification-based training dynamic RLVR introduces.

How It Works: From Fuzzy Grading to Clean Choices

Data Preparation: For a given prompt, annotators or a teacher model create pairs of responses: a preferred one and a dispreferred one.
Training Reformulation: Instead of training a reward model to assign an absolute score to a single response (e.g., "this poem is 7/10"), the model is trained to perform a binary classification: "Given Prompt P and Responses A & B, which is better?"
RL Fine-Tuning: This binary-choice reward model then provides a clean, verifiable reward signal for a reinforcement learning algorithm (like PPO) to fine-tune the language model. The policy learns to generate responses that are more likely to be chosen as the "better" option in a comparison.
Avoiding Collapse: The authors identified a critical failure mode: training solely on choice tasks can lead the model to produce unnaturally short, conservative responses that are easy to verify as "better" than bad ones. To counter this, they mix in a small amount of standard RLHF objective, which helps maintain natural output length and overall utility.

Why It Matters: A More General Lesson for Alignment

The paper's strongest claim extends beyond a single technique. It demonstrates that reasoning and alignment can be improved by replacing fuzzy, absolute scoring with structured comparison. This is a more general lesson for AI alignment: teaching models how to judge may be more effective and scalable than teaching them what to output.

By framing learning as a process of verification, RLVR moves closer to how humans improve at subjective tasks—through critique and comparative analysis, not by memorizing a "correct" template. This approach could make RL a more universally applicable tool for aligning AI systems with complex, human-like values.

gentic.news Analysis

This work from Baidu's research team arrives amid an industry-wide scramble to improve the "reasoning" and nuanced instruction-following capabilities of large language models beyond simple fact recall. As we covered in our analysis of DeepSeek's latest reasoning models, the frontier has shifted from pure scale to sophisticated training methodologies that teach models how to think. RLVR fits squarely into this trend, offering a novel way to inject reliable learning signals into the murky domain of subjective quality.

The paper's finding that DPO underperforms is particularly noteworthy. It suggests that the field's recent heavy reliance on preference optimization techniques, while powerful, might not be the final answer for cultivating advanced competencies like judgment and comparative reasoning. RLVR's success hints at a hybrid future where RL and contrastive learning are more deeply integrated.

Furthermore, Baidu's continued investment in fundamental RL research, as seen here, signals its commitment to remaining competitive in the foundational model race, despite the dominant narrative focusing on U.S.-based labs. This follows a pattern of increased technical publications from Chinese AI giants like Alibaba Cloud and 01.AI, all aiming to carve out unique methodological advantages. RLVR represents a credible, exportable innovation that could influence training pipelines globally, not just within Ernie models.

Frequently Asked Questions

What is RLVR?

RLVR (Reinforcement Learning from Verifiable Responses) is a training method developed by Baidu researchers. It reformulates open-ended, subjective tasks (like writing a poem) as multiple-choice questions for AI models. Instead of scoring a single output, the model learns by repeatedly choosing the better of two responses, creating a cleaner reward signal for reinforcement learning algorithms to use.

How is RLVR different from standard RLHF?

Standard Reinforcement Learning from Human Feedback (RLHF) typically involves training a reward model to assign a scalar score (e.g., 0-10) to a single model response. RLVR changes the task: the reward model is trained to make a binary choice between two responses. This turns a fuzzy grading problem into a more reliable verification task, which the paper shows leads to better performance on open-ended benchmarks.

Why did the researchers need to mix in RLHF?

The team discovered that training a model exclusively on binary choice tasks had a side effect: it drove the model to produce very short, safe responses. A short answer is easier to verify as "better" than a terrible long one. To prevent this collapse in output quality and length, they added a small component of standard RLHF training to the overall objective, which preserved natural, useful response characteristics.

What models and benchmarks were used?

The core experiments were conducted on a 14-billion-parameter reasoning model. The method was evaluated across seven different open-ended benchmarks designed to test creative writing, subjective question-answering, and reasoning. RLVR achieved an average improvement of 3.29 points over a strong RLHF baseline trained on the same model and data.

Source: "Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation" (arXiv:2511.02463)

Source: gentic.news · Apr 13, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This paper makes a subtle but important contribution to the alignment toolkit. Its core insight—that verification is a more learnable and reliable task than absolute scoring for subjective domains—has legs. It reframes the alignment problem from 'distill human judgment into a score' to 'teach the model to perform the judgment process itself.' This is conceptually closer to processes like Constitutional AI or self-critique, but implemented through a straightforward reformulation of the RL training loop. The significant underperformance of the DPO ablation is the most technically telling result. It indicates the benefit isn't merely from the data format (preference pairs), but from the specific verification-based objective function. This challenges the assumption that all preference-based methods are roughly equivalent given the same data. It suggests the training dynamic—actively comparing and selecting—imparts a different, and for reasoning tasks, superior inductive bias. Practitioners should note the caveat: the method in isolation leads to short-output collapse, necessitating a hybrid objective. This isn't a flaw but a lesson in the complexity of optimizing for multiple, sometimes competing, qualities (e.g., correctness vs. verbosity). The successful hybrid approach mirrors trends in model merging and multi-objective optimization, highlighting that the next generation of capable models will likely be products of carefully balanced, composite training regimens.

#natural language processing #research #reinforcement learning #model alignment

Mentioned in this article

RLVR Baidu

Enjoyed this article?