Technique · alignment

Reinforcement Learning from Human Feedback (RLHF)

A three-stage recipe (SFT → reward model from human comparisons → PPO) that aligns LM outputs with human preferences. InstructGPT is the canonical reference.

Origin: OpenAI, 2022-03Read origin paper →Also known as: RLHF, Human Feedback RL

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

GPT-5.2 Pro
Deployed 2026-02-17 · Velocity 4y
“OpenAI's alignment approach for flagship models is built on RLHF, as documented for GPT-4 and previous models.”
high
GPT-5.3
Deployed 2026-02-26 · Velocity 4y
“OpenAI pioneered RLHF with InstructGPT; GPT-5.3 continues this alignment approach.”
medium
DeepSeek-R1
Deployed 2026-03-17 · Velocity 4y
“Trained via reinforcement learning from human feedback to align with preferences.”
high

Techniques built on this

Direct Preference Optimization (DPO)
Identity Preference Optimization (IPO)
Constitutional AI
RLAIF (Reinforcement Learning from AI Feedback)
Process Reward Models
Rejection Sampling Fine-Tuning
Red-Teaming with Preference Models