Proximal Policy Optimization (PPO) is a policy gradient reinforcement learning algorithm introduced by Schulman et al. in 2017. It is designed to improve the stability and sample efficiency of policy optimization by preventing large, destabilizing updates to the policy network. PPO achieves this through a clipped surrogate objective function that penalizes policy changes that deviate too far from the previous policy, effectively enforcing a trust region without the computational overhead of natural gradient methods like TRPO.
Technically, PPO alternates between sampling data from the environment and optimizing a clipped objective. The objective is: L^CLIP(θ) = E_t[min(r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t)], where r_t(θ) is the probability ratio of the new policy to the old policy, A_t is the advantage estimate, and ε is a hyperparameter (typically 0.2). The clipping ensures that the update does not move r_t(θ) outside [1-ε, 1+ε], making the optimization conservative. PPO often includes a value function loss and an entropy bonus to encourage exploration.
PPO matters because it became the default algorithm for fine-tuning large language models (LLMs) with reinforcement learning from human feedback (RLHF). Models such as GPT-4, Claude, Gemini, and Llama 2/3 have used PPO-based RLHF to align model outputs with human preferences. PPO's stability makes it practical for training models with billions of parameters, where unstable updates could be catastrophic. Alternatives include DPO (Direct Preference Optimization), which eliminates the need for a reward model and online sampling, and REINFORCE variants like GRPO (Group Relative Policy Optimization) used in DeepSeek-R1. PPO is preferred when a high-quality reward model is available and the training budget allows for the overhead of maintaining a separate value network and sampling rollouts.
Common pitfalls with PPO include: sensitivity to reward model quality (a poorly trained reward model can lead to reward hacking), high computational cost due to the need for per-step advantage estimation and value function training, difficulty tuning the clipping epsilon and learning rate, and instability in environments with sparse rewards. Additionally, PPO's on-policy nature requires fresh samples each iteration, which can be expensive for LLM fine-tuning.
As of 2026, PPO remains widely used but faces competition from simpler methods. DPO has become popular for its efficiency and elimination of the reward model, while GRPO (used in DeepSeek-R1) reduces memory by removing the value network. However, PPO still achieves state-of-the-art results in complex multi-turn tasks, robotics (e.g., Isaac Gym), and game-playing (e.g., Dota 2 with OpenAI Five). Research focuses on adaptive clipping, off-policy extensions, and hybrid approaches combining PPO with DPO-like objectives.