The Deployment Atlas
When AI research reaches production.
For every foundational AI technique of the modern era — transformers, RLHF, FlashAttention, Constitutional AI, speculative decoding, DPO, MoE — we track the origin paper, the first commercial deployment, and the velocity between. Every edge is sourced. Every claim is evidenced. The full dataset is free and open.
Technique × product pairs, each with sourced evidence.
Typical lag from origin paper to first commercial deploy.
Hand-curated with a single origin paper each. No paper-product 1:1 lies.
Fastest deploy ever
Llama 4 Maverick shipped YaRN RoPE Context Extension in 583 days.
Slowest deploy
Kimi K2.6 shipped Mixture of Experts (Sparse MoE for LLMs) 9y after the origin paper.
Every canonical technique
Grouped by category. Click any card for origin paper, deployment timeline, and prior art.
agents · 3 techniques
ReAct (Reason + Act)
Princeton / Google · 2022-10
An agent pattern that interleaves reasoning traces with tool-use actions, using each observation to refine the next reasoning step.
Toolformer (Tool Use)
Meta AI · 2023-02
Self-supervised approach where an LM learns when and how to call external APIs by generating and filtering its own tool-use demonstrations.
Reflexion
Northeastern / MIT · 2023-03
Agent framework that converts environment feedback into verbal self-reflection stored in memory, improving performance across trials without weight updates.
alignment · 9 techniques
Deep RL from Human Preferences
OpenAI · 2017-06
Learning reward functions from pairwise human comparisons rather than hand-coded rewards. The direct precursor to RLHF.
Red-Teaming with Preference Models
Google DeepMind · 2022-02
Using an LM to generate adversarial prompts that elicit harmful behavior, scaling safety evaluation far beyond human red-teaming.
Reinforcement Learning from Human Feedback (RLHF)
OpenAI · 2022-03
A three-stage recipe (SFT → reward model from human comparisons → PPO) that aligns LM outputs with human preferences. InstructGPT is the canonical reference.
Constitutional AI
Anthropic · 2022-12
Training harmless assistants using a written constitution of principles and an AI-generated critique/revision loop rather than human labels for every case.
Direct Preference Optimization (DPO)
Stanford · 2023-05
Aligning LMs to preference data by directly optimizing a closed-form likelihood ratio, eliminating the reward model and RL loop of RLHF.
RLAIF (Reinforcement Learning from AI Feedback)
Google · 2023-09
Using an off-the-shelf LLM to generate preference labels, scaling preference learning without human annotators.
Identity Preference Optimization (IPO)
Google DeepMind · 2023-10
A preference-optimization variant that avoids DPO's over-fitting by adding an explicit regularizer.
Self-Rewarding Language Models
Meta AI · 2024-01
Iterative alignment where the LM judges its own outputs using an LLM-as-a-judge prompt, removing human-labeled preferences from the loop.
KTO (Kahneman-Tversky Optimization)
Contextual AI · 2024-02
Alignment method that treats individual completions as binary-good/bad signals (no preference pairs needed) inspired by prospect theory.
architecture · 7 techniques
Mixture of Experts (Sparse MoE for LLMs)
Google · 2017-01
An architecture where a router activates only a subset of expert sub-networks per token, scaling parameter count without proportional compute cost.
Transformer Self-Attention
Google · 2017-06
A sequence-to-sequence architecture that replaces recurrence with scaled dot-product attention, enabling parallel training and long-range context modeling.
Rotary Position Embedding (RoPE)
Zhuiyi Technology · 2021-04
A relative-position encoding that rotates query/key vectors in complex space, giving transformers better length extrapolation than absolute sinusoidal embeddings.
Grouped-Query Attention (GQA)
Google · 2023-05
An inference-time optimization that groups multiple query heads to share a single key/value head, reducing KV cache memory at minimal quality loss.
YaRN RoPE Context Extension
Nous Research · 2023-08
A method to extend RoPE-based models to much longer contexts via frequency-dependent interpolation, with minimal fine-tuning data.
Mamba / Selective State Space Models
CMU · 2023-12
A state-space sequence model with input-dependent selection that matches Transformer quality with linear inference cost and unlimited context.
Mixture of Depths
Google DeepMind · 2024-04
A technique letting tokens skip transformer layers when unnecessary, allocating compute adaptively based on token importance.
inference · 8 techniques
FlashAttention
Stanford · 2022-05
A tiled, IO-aware attention kernel that computes exact attention with linear memory by fusing reads/writes to SRAM.
Continuous Batching
Seoul National University · 2022-07
A scheduling technique that adds/removes requests at the iteration level rather than the batch level, dramatically increasing throughput for LLM serving.
INT8 Weight Quantization for LLMs
University of Washington · 2022-08
Row-wise and vector-wise INT8 quantization with outlier detection that enables zero-degradation 8-bit inference of LLMs.
GPTQ Quantization
ISTA · 2022-10
Post-training quantization to 3-4 bits using second-order information, enabling 175B-scale LLMs to run on single-GPU inference.
Speculative Decoding
Google · 2022-11
An inference technique where a small draft model proposes tokens and a large model verifies them in parallel, yielding 2-3x speedup without quality loss.
AWQ (Activation-Aware Weight Quantization)
MIT · 2023-06
4-bit weight quantization that preserves salient weights based on activation magnitudes, matching GPTQ quality with faster inference.
PagedAttention (vLLM)
UC Berkeley · 2023-09
A memory-management scheme for KV cache modeled on OS paging, eliminating fragmentation and enabling high-throughput serving.
StreamingLLM (Attention Sinks)
MIT · 2023-09
A sliding-window attention pattern with preserved initial tokens ("sinks") that enables indefinite streaming generation without quality collapse.
interpretability · 1 techniques
multimodal · 6 techniques
Vision Transformer (ViT)
Google · 2020-10
Applying a standard Transformer directly to sequences of image patches, establishing Transformers as the dominant image-recognition backbone.
CLIP (Contrastive Language-Image Pretraining)
OpenAI · 2021-02
Dual-encoder model trained on 400M image-caption pairs to align image and text embeddings, enabling zero-shot visual classification.
Latent Diffusion
LMU Munich / RunwayML · 2021-12
Diffusion performed in a compressed VAE latent space, making high-resolution image generation tractable on consumer GPUs.
Flamingo (Cross-Attention VLMs)
Google DeepMind · 2022-04
Cross-attention layers interleaved into a frozen LLM that attend to vision features, enabling few-shot visual question answering.
Whisper (Robust Speech Recognition)
OpenAI · 2022-12
Encoder-decoder Transformer trained on 680k hours of weakly-supervised multilingual speech, setting new robustness benchmarks across accents and noise.
LLaVA (Visual Instruction Tuning)
University of Wisconsin · 2023-04
Projecting CLIP features into an LLM's token space via a simple projector + instruction tuning on GPT-4-generated visual conversations.
reasoning · 6 techniques
Chain-of-Thought Prompting
Google · 2022-01
A prompting technique that elicits step-by-step reasoning by showing exemplars that include intermediate reasoning steps.
Self-Consistency
Google · 2022-03
Sample multiple CoT completions and take the majority-vote answer, substantially improving reasoning accuracy.
Zero-Shot Chain-of-Thought
University of Tokyo · 2022-05
Eliciting step-by-step reasoning without few-shot exemplars, simply by appending a phrase like "let's think step by step".
Tree of Thoughts
Princeton / Google DeepMind · 2023-05
Reasoning over a tree of intermediate thoughts with explicit look-ahead, backtracking, and self-evaluation, beyond linear CoT.
Process Reward Models
OpenAI · 2023-05
Reward models trained to score each intermediate reasoning step rather than only the final answer, enabling superior reasoning policy learning.
Test-Time Compute Scaling
Google DeepMind · 2024-08
Allocating more compute at inference (longer reasoning chains, multiple samples + verifier) can outperform scaling parameters — the basis for o1-style reasoning models.
retrieval · 2 techniques
Dense Passage Retrieval (DPR)
Meta AI · 2020-04
Learned dual-encoder retrieval that outperforms BM25 on open-domain QA by training encoders on question-passage pairs.
Retrieval-Augmented Generation (RAG)
Meta AI · 2020-05
Conditioning generation on retrieved passages from a non-parametric memory, combining parametric and retrieval-based knowledge.
training · 7 techniques
LoRA (Low-Rank Adaptation)
Microsoft · 2021-06
Parameter-efficient fine-tuning that injects low-rank decomposition matrices into attention weights, training <1% of parameters.
Instruction Tuning (FLAN)
Google · 2021-09
Fine-tuning a pretrained LM on a mixture of tasks phrased as natural-language instructions, enabling strong zero-shot generalization.
Chinchilla Scaling Laws
Google DeepMind · 2022-03
Scaling law showing compute-optimal models use ~20 training tokens per parameter — correcting prior over-parameterization in GPT-3-era models.
Self-Instruct
University of Washington · 2022-12
Bootstrapping instruction-tuning data by having an LM generate its own instructions, inputs, and outputs from a small seed set.
QLoRA
University of Washington · 2023-05
LoRA fine-tuning on 4-bit quantized base weights, enabling 65B-model fine-tuning on a single 48GB GPU.
Synthetic Data Distillation (Orca)
Microsoft Research · 2023-06
Training smaller models on GPT-4-generated explanation traces rather than answer-only demonstrations, closing the capability gap.
Rejection Sampling Fine-Tuning
Meta AI · 2023-07
Sampling multiple completions, scoring with a reward model, and fine-tuning on the top samples — a simpler alternative to PPO used in Llama 2.
Open dataset
Every technique, paper, and deployment is freely available under CC BY 4.0. API endpoint: /api/v1/atlas/techniques. Cite us as: gentic.news Deployment Atlas (2026).