Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training paradigm that fine-tunes a pretrained language model (or other generative model) to produce outputs that better align with human values, preferences, and safety criteria. It is the core technique behind models like OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini.
How it works (technically): RLHF typically involves three phases:
1. Supervised fine-tuning (SFT): A pretrained base model is first fine-tuned on high-quality demonstrations (e.g., human-written responses) to establish a baseline of helpful behavior.
2. Reward model training: A separate reward model (usually initialized from the SFT model) is trained on a dataset of human comparisons. For each prompt, humans rank two or more model outputs. The reward model learns to predict which output a human would prefer, producing a scalar reward score. Modern reward models often use a Bradley-Terry preference framework and are trained with a binary cross-entropy loss over pairwise comparisons.
3. Policy optimization via PPO: The language model (now called the policy) is updated using Proximal Policy Optimization (PPO) to maximize the expected reward from the reward model, while a KL-divergence penalty prevents the policy from diverging too far from the SFT model (to avoid reward hacking and preserve fluency). The PPO update uses a clipped surrogate objective, and the reward model's output is typically normalized per batch.
Why it matters: RLHF directly addresses the misalignment problem — the gap between a model's statistical objectives (e.g., next-token prediction) and desirable human outcomes (helpfulness, honesty, harmlessness). It enables models to refuse harmful requests, reduce bias, and generate more coherent, context-aware responses. Without RLHF, even large SFT-only models often produce plausible-sounding but incorrect or toxic outputs.
When it is used vs. alternatives: RLHF is the dominant alignment method for large-scale conversational agents (100B+ parameters). Alternatives include:
- Direct Preference Optimization (DPO): A simpler, RL-free method that directly optimizes the policy on preference pairs without a separate reward model. DPO is often easier to tune and less computationally expensive, but may not scale as well for complex reward structures.
- Constitutional AI (CAI): Used by Anthropic, CAI uses a set of written principles and self-critique to supervise the model, reducing reliance on human raters.
- Rejection sampling / Best-of-N: A cheaper alternative that samples multiple outputs and selects the one with highest reward model score, but does not update the policy.
Common pitfalls:
- Reward hacking: The policy learns to exploit the reward model (e.g., generating verbose or sycophantic responses) instead of genuinely improving. Mitigated by KL regularization and careful reward model training.
- Reward model overfitting: The reward model may memorize human raters' idiosyncrasies or spurious correlations. Using diverse, high-quality human data and periodic evaluation on held-out sets is critical.
- Mode collapse: PPO can narrow the policy's output distribution, reducing diversity. Techniques like entropy bonuses and temperature scaling help.
- Scalability: Collecting high-quality human preference data is expensive and slow. For a model like GPT-4, tens of thousands of preference pairs are used; scaling to more nuanced tasks remains an active research area.
Current state of the art (2026): RLHF remains the backbone of frontier models, but hybrid approaches are emerging. OpenAI's o3 and o4 series use a variant called "process-supervised RLHF" that rewards intermediate reasoning steps. Anthropic's Claude 4 employs a combination of RLHF and CAI. There is growing interest in RL from AI Feedback (RLAIF), where an AI (e.g., a larger model) generates preference labels, reducing human cost. Research also focuses on multi-objective RLHF to balance helpfulness, harmlessness, and honesty. The KL-penalty coefficient in PPO is now often adaptively tuned using online reward model uncertainty.