
HeRL Framework Uses Hindsight Experience to Improve RL Exploration for LLMs, Boosts GSM8K by 4.1%
Researchers propose HeRL, a reinforcement learning framework that uses failed trajectories as in-context guidance to improve LLM exploration. The method achieves a 4.1% absolute gain on GSM8K over PPO baselines.























