Skip to content
gentic.news — AI News Intelligence Platform

Technique · alignment

Reinforcement Learning from Human Feedback (RLHF)

A three-stage recipe (SFT → reward model from human comparisons → PPO) that aligns LM outputs with human preferences. InstructGPT is the canonical reference.

Origin: OpenAI, 2022-03Read origin paper →Also known as: RLHF, Human Feedback RL
3
Products deploying
4y
Avg research → prod
4y
First commercial deploy

Deployment timeline

  1. GPT-5.2 Pro

    Deployed 2026-02-17 · Velocity 4y

    OpenAI's alignment approach for flagship models is built on RLHF, as documented for GPT-4 and previous models.

    high
  2. GPT-5.3

    Deployed 2026-02-26 · Velocity 4y

    OpenAI pioneered RLHF with InstructGPT; GPT-5.3 continues this alignment approach.

    medium
  3. DeepSeek-R1

    Deployed 2026-03-17 · Velocity 4y

    Trained via reinforcement learning from human feedback to align with preferences.

    high