The Deployment Atlas

When AI research reaches production.

Name: gentic.news Deployment Atlas
Creator: gentic.news
License: https://creativecommons.org/licenses/by/4.0/

For every foundational AI technique of the modern era — transformers, RLHF, FlashAttention, Constitutional AI, speculative decoding, DPO, MoE — we track the origin paper, the first commercial deployment, and the velocity between. Every edge is sourced. Every claim is evidenced. The full dataset is free and open.

192

Deployments tracked

Technique × product pairs, each with sourced evidence.

Median research → prod

Typical lag from origin paper to first commercial deploy.

Canonical techniques

Hand-curated with a single origin paper each. No paper-product 1:1 lies.

Fastest deploy ever

Llama 4 Maverick shipped YaRN RoPE Context Extension in 583 days.

Slowest deploy

Kimi K2.6 shipped Mixture of Experts (Sparse MoE for LLMs) 9y after the origin paper.

Every canonical technique

Grouped by category. Click any card for origin paper, deployment timeline, and prior art.

Methodology →

agents · 3 techniques

ReAct (Reason + Act)

Princeton / Google · 2022-10

An agent pattern that interleaves reasoning traces with tool-use actions, using each observation to refine the next reasoning step.

Toolformer (Tool Use)

Meta AI · 2023-02

Self-supervised approach where an LM learns when and how to call external APIs by generating and filtering its own tool-use demonstrations.

Reflexion

Northeastern / MIT · 2023-03

Agent framework that converts environment feedback into verbal self-reflection stored in memory, improving performance across trials without weight updates.

alignment · 9 techniques

Deep RL from Human Preferences

OpenAI · 2017-06

Learning reward functions from pairwise human comparisons rather than hand-coded rewards. The direct precursor to RLHF.

Red-Teaming with Preference Models

Google DeepMind · 2022-02

Using an LM to generate adversarial prompts that elicit harmful behavior, scaling safety evaluation far beyond human red-teaming.

Reinforcement Learning from Human Feedback (RLHF)

OpenAI · 2022-03

A three-stage recipe (SFT → reward model from human comparisons → PPO) that aligns LM outputs with human preferences. InstructGPT is the canonical reference.

Constitutional AI

Anthropic · 2022-12

Training harmless assistants using a written constitution of principles and an AI-generated critique/revision loop rather than human labels for every case.

Direct Preference Optimization (DPO)

Stanford · 2023-05

Aligning LMs to preference data by directly optimizing a closed-form likelihood ratio, eliminating the reward model and RL loop of RLHF.

RLAIF (Reinforcement Learning from AI Feedback)

Google · 2023-09

Using an off-the-shelf LLM to generate preference labels, scaling preference learning without human annotators.

Identity Preference Optimization (IPO)

Google DeepMind · 2023-10

A preference-optimization variant that avoids DPO's over-fitting by adding an explicit regularizer.

Self-Rewarding Language Models

Meta AI · 2024-01

Iterative alignment where the LM judges its own outputs using an LLM-as-a-judge prompt, removing human-labeled preferences from the loop.

KTO (Kahneman-Tversky Optimization)

Contextual AI · 2024-02

Alignment method that treats individual completions as binary-good/bad signals (no preference pairs needed) inspired by prospect theory.

architecture · 7 techniques

Mixture of Experts (Sparse MoE for LLMs)

Google · 2017-01

An architecture where a router activates only a subset of expert sub-networks per token, scaling parameter count without proportional compute cost.

Transformer Self-Attention

Google · 2017-06

A sequence-to-sequence architecture that replaces recurrence with scaled dot-product attention, enabling parallel training and long-range context modeling.

Rotary Position Embedding (RoPE)

Zhuiyi Technology · 2021-04

A relative-position encoding that rotates query/key vectors in complex space, giving transformers better length extrapolation than absolute sinusoidal embeddings.

Grouped-Query Attention (GQA)

Google · 2023-05

An inference-time optimization that groups multiple query heads to share a single key/value head, reducing KV cache memory at minimal quality loss.

YaRN RoPE Context Extension

Nous Research · 2023-08

A method to extend RoPE-based models to much longer contexts via frequency-dependent interpolation, with minimal fine-tuning data.

Mamba / Selective State Space Models

CMU · 2023-12

A state-space sequence model with input-dependent selection that matches Transformer quality with linear inference cost and unlimited context.

Mixture of Depths

Google DeepMind · 2024-04

A technique letting tokens skip transformer layers when unnecessary, allocating compute adaptively based on token importance.

inference · 8 techniques

FlashAttention

Stanford · 2022-05

A tiled, IO-aware attention kernel that computes exact attention with linear memory by fusing reads/writes to SRAM.

Continuous Batching

Seoul National University · 2022-07

A scheduling technique that adds/removes requests at the iteration level rather than the batch level, dramatically increasing throughput for LLM serving.

INT8 Weight Quantization for LLMs

University of Washington · 2022-08

Row-wise and vector-wise INT8 quantization with outlier detection that enables zero-degradation 8-bit inference of LLMs.

GPTQ Quantization

ISTA · 2022-10

Post-training quantization to 3-4 bits using second-order information, enabling 175B-scale LLMs to run on single-GPU inference.

Speculative Decoding

Google · 2022-11

An inference technique where a small draft model proposes tokens and a large model verifies them in parallel, yielding 2-3x speedup without quality loss.

AWQ (Activation-Aware Weight Quantization)

MIT · 2023-06

4-bit weight quantization that preserves salient weights based on activation magnitudes, matching GPTQ quality with faster inference.

PagedAttention (vLLM)

UC Berkeley · 2023-09

A memory-management scheme for KV cache modeled on OS paging, eliminating fragmentation and enabling high-throughput serving.

StreamingLLM (Attention Sinks)

MIT · 2023-09

A sliding-window attention pattern with preserved initial tokens ("sinks") that enables indefinite streaming generation without quality collapse.

interpretability · 1 techniques

Sparse Autoencoders for Interpretability

Anthropic · 2023-10

Training sparse autoencoders on residual-stream activations to extract monosemantic, human-interpretable features from transformer internals.

multimodal · 6 techniques

Vision Transformer (ViT)

Google · 2020-10

Applying a standard Transformer directly to sequences of image patches, establishing Transformers as the dominant image-recognition backbone.

CLIP (Contrastive Language-Image Pretraining)

OpenAI · 2021-02

Dual-encoder model trained on 400M image-caption pairs to align image and text embeddings, enabling zero-shot visual classification.

Latent Diffusion

LMU Munich / RunwayML · 2021-12

Diffusion performed in a compressed VAE latent space, making high-resolution image generation tractable on consumer GPUs.

Flamingo (Cross-Attention VLMs)

Google DeepMind · 2022-04

Cross-attention layers interleaved into a frozen LLM that attend to vision features, enabling few-shot visual question answering.

Whisper (Robust Speech Recognition)

OpenAI · 2022-12

Encoder-decoder Transformer trained on 680k hours of weakly-supervised multilingual speech, setting new robustness benchmarks across accents and noise.

LLaVA (Visual Instruction Tuning)

University of Wisconsin · 2023-04

Projecting CLIP features into an LLM's token space via a simple projector + instruction tuning on GPT-4-generated visual conversations.

reasoning · 6 techniques

Chain-of-Thought Prompting

Google · 2022-01

A prompting technique that elicits step-by-step reasoning by showing exemplars that include intermediate reasoning steps.

Self-Consistency

Google · 2022-03

Sample multiple CoT completions and take the majority-vote answer, substantially improving reasoning accuracy.

Zero-Shot Chain-of-Thought

University of Tokyo · 2022-05

Eliciting step-by-step reasoning without few-shot exemplars, simply by appending a phrase like "let's think step by step".

Tree of Thoughts

Princeton / Google DeepMind · 2023-05

Reasoning over a tree of intermediate thoughts with explicit look-ahead, backtracking, and self-evaluation, beyond linear CoT.

Process Reward Models

OpenAI · 2023-05

Reward models trained to score each intermediate reasoning step rather than only the final answer, enabling superior reasoning policy learning.

Test-Time Compute Scaling

Google DeepMind · 2024-08

Allocating more compute at inference (longer reasoning chains, multiple samples + verifier) can outperform scaling parameters — the basis for o1-style reasoning models.

retrieval · 2 techniques

Dense Passage Retrieval (DPR)

Meta AI · 2020-04

Learned dual-encoder retrieval that outperforms BM25 on open-domain QA by training encoders on question-passage pairs.

Retrieval-Augmented Generation (RAG)

Meta AI · 2020-05

Conditioning generation on retrieved passages from a non-parametric memory, combining parametric and retrieval-based knowledge.

training · 7 techniques

LoRA (Low-Rank Adaptation)

Microsoft · 2021-06

Parameter-efficient fine-tuning that injects low-rank decomposition matrices into attention weights, training <1% of parameters.

Instruction Tuning (FLAN)

Google · 2021-09

Fine-tuning a pretrained LM on a mixture of tasks phrased as natural-language instructions, enabling strong zero-shot generalization.

Chinchilla Scaling Laws

Google DeepMind · 2022-03

Scaling law showing compute-optimal models use ~20 training tokens per parameter — correcting prior over-parameterization in GPT-3-era models.

Self-Instruct

University of Washington · 2022-12

Bootstrapping instruction-tuning data by having an LM generate its own instructions, inputs, and outputs from a small seed set.

QLoRA

University of Washington · 2023-05

LoRA fine-tuning on 4-bit quantized base weights, enabling 65B-model fine-tuning on a single 48GB GPU.

Synthetic Data Distillation (Orca)

Microsoft Research · 2023-06

Training smaller models on GPT-4-generated explanation traces rather than answer-only demonstrations, closing the capability gap.

Rejection Sampling Fine-Tuning

Meta AI · 2023-07

Sampling multiple completions, scoring with a reward model, and fine-tuning on the top samples — a simpler alternative to PPO used in Llama 2.

Open dataset

Every technique, paper, and deployment is freely available under CC BY 4.0. API endpoint: /api/v1/atlas/techniques. Cite us as: gentic.news Deployment Atlas (2026).