SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training. The method yields +9.4% gains on ALFWorld and improvements on WebShop and Search-QA across Qwen2.5 and Qwen3 models.
Key facts
- SDAR yields +9.4% on ALFWorld benchmark.
- Improvements also on WebShop and Search-QA.
- Method gates self-distillation inside GRPO.
- Evaluated on Qwen2.5 and Qwen3 models.
- Addresses instability in multi-turn agent training.
Self-Distilled Agentic Reinforcement Learning (SDAR) tackles the instability that plagues multi-turn LLM agent training. The core innovation: gating self-distillation signals inside GRPO (Group Relative Policy Optimization), a reinforcement learning framework. [According to @HuggingPapers] the approach produced a +9.4% improvement on the ALFWorld benchmark, with additional gains on WebShop and Search-QA.
The unique take here is that SDAR treats self-distillation not as a standalone regularization trick but as a gated signal within an existing RL loop. Prior work often added distillation as an auxiliary loss or separate phase — SDAR embeds it directly into GRPO's advantage estimation, effectively letting the model decide when to trust its own prior behavior. This avoids the catastrophic forgetting and reward hacking that commonly destabilize multi-turn agent training.
How SDAR Works
SDAR operates within the GRPO framework. In standard GRPO, the model generates multiple responses for a given state, and the group-relative advantage is computed. SDAR adds a self-distillation term that gates — i.e., selectively applies — the teacher signal from the model's own prior policy. The gating mechanism prevents the distillation signal from dominating the RL objective, preserving exploration while still providing a stabilizing baseline.
Evaluation was conducted on three benchmarks: ALFWorld (embodied task completion), WebShop (e-commerce navigation), and Search-QA (multi-step information retrieval). The +9.4% on ALFWorld is the headline number. The paper reports consistent improvements across both Qwen2.5 and Qwen3 model families, indicating the technique is architecture-agnostic within the Qwen lineage. [The arXiv preprint] does not disclose compute budgets or training hyperparameters, so reproducibility details remain thin.
Why This Matters
Multi-turn agent training remains brittle. Most RL-based agent systems either collapse into repetitive behavior or fail to generalize across environments. SDAR's gated self-distillation provides a relatively simple fix — one that could be adopted by any team already using GRPO. The approach is especially relevant for coding agents (e.g., SWE-agent, CodeAct) and browser-automation agents (e.g., WebGPT successors) where multi-turn trajectories are long and reward signals sparse.
Limitations
SDAR's gains are demonstrated only on Qwen2.5 and Qwen3 models. Whether the technique transfers to Llama 4, Claude, or GPT-5-class models is an open question. The paper also does not compare against alternative stabilization methods such as PPO with KL penalty, RLOO, or REINFORCE variants. The +9.4% on ALFWorld is impressive, but ALFWorld is a simulated environment — real-world agent tasks introduce additional noise from API latency, tool failures, and ambiguous user queries.
What to watch
Watch for an open-source code release with training hyperparameters and a comparison against PPO + KL penalty. If SDAR is adopted by agent frameworks like LangGraph or CrewAI within 90 days, it signals a shift from custom RL training recipes toward standardized agentic RL pipelines.








