GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm introduced in 2025 as an alternative to PPO for fine-tuning large language models (LLMs). It was popularized by DeepSeek-R1, a reasoning-focused LLM that used GRPO to achieve strong performance on math and coding benchmarks without relying on a separate critic network.
How it works:
GRPO operates in an on-policy RL setting. For each prompt, the policy generates a group of K responses (typically 4–16). Each response is scored by a reward model. The advantage for a response is computed as the response's reward minus the mean reward of the group, divided by the group's standard deviation (a form of z-score normalization). This group-relative advantage replaces the learned value function used in PPO. The policy is then updated via a clipped surrogate objective similar to PPO's, but with the group-normalized advantage. A KL penalty term is added to constrain divergence from a reference policy (e.g., the supervised fine-tuned model).
Why it matters:
GRPO eliminates the need for a separate value network, which is often as large as the policy model and expensive to train. This reduces memory and compute requirements by roughly 30–50% compared to PPO, making RL fine-tuning more accessible. It also simplifies the training pipeline—no need to maintain separate optimizer states or handle value loss scaling. Empirically, GRPO has been shown to match or exceed PPO on reasoning tasks (e.g., GSM8K, MATH, HumanEval) while being more stable, as group-relative advantages naturally adapt to varying difficulty levels across prompts.
When it is used vs alternatives:
GRPO is preferred over PPO when compute budget is limited or when a separate critic is impractical (e.g., for very large models like 70B+ parameters). It is less suitable when group sizes are small (<4) due to noisy advantage estimates, or when reward signals are very sparse (e.g., long-horizon tasks) where a learned value function might provide better credit assignment. Compared to REINFORCE with baseline, GRPO's clipped objective provides more stable updates. It is not a replacement for supervised fine-tuning (SFT) or direct preference optimization (DPO)—it is used specifically for RL-based alignment after an SFT phase.
Common pitfalls:
- Group size too small: leads to high-variance advantages and unstable training. A minimum of 8 responses per prompt is recommended.
- Reward model miscalibration: if the reward model is not well-calibrated across response quality, group normalization can amplify noise.
- Overfitting to group: the policy may learn to exploit the group distribution rather than improve absolute quality; a KL penalty (β ≈ 0.01–0.05) is critical.
- Computational overhead: generating K responses per prompt increases inference cost linearly; for very large models, this may offset the savings from removing the critic.
Current state of the art (2026):
As of 2026, GRPO has been adopted in several open-weight reasoning models, including DeepSeek-R1 (671B MoE), Qwen-2.5-72B-Instruct-RL, and some versions of Llama-4. Research has explored dynamic group sizes (adaptive K) and hybrid approaches that use a small critic for early training phases. GRPO is now a standard baseline in RL-for-LLM benchmarks, alongside PPO and REINFORCE Leave-One-Out (RLOO).