The AI Glossary
200 precise, current definitions for the terms shaping artificial intelligence in 2026 — from RAG and DPO to MoE, MCP, and SWE-bench. Each entry is hand-curated, tied to live news coverage, and linked to deeper context across the gentic.news graph.
See also: The LLM Wiki (Karpathy pattern explainer) · AI Data Center Glossary (infrastructure-only).
A
AI Agent
AgentsAn AI agent is a software system that perceives its environment, makes decisions, and takes actions autonomously to achieve a goal, often using large language models as its reasoning core.
ARC-AGI
EvaluationARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark of 800 unique, visual reasoning puzzles designed to measure general fluid intelligence in AI, requiring few-shot learning and compositional abstraction from minimal examples.
Adapter Tuning
Training & InferenceAdapter Tuning inserts small trainable bottleneck layers (adapters) into a frozen pretrained model, updating only those parameters during fine-tuning. This achieves parameter-efficient transfer learning with fewer than 1% of full fine-tuning parameters.
Agent2Agent Protocol
AgentsAgent2Agent Protocol (A2A) is an open standard enabling autonomous AI agents from different systems to securely discover, communicate, and coordinate tasks in real time, without human intervention.
Agentic Loop
AgentsAn agentic loop is the iterative cycle in which an AI agent perceives its environment, reasons about a goal, selects an action, executes it, observes the outcome, and repeats — often with memory or feedback — to autonomously complete complex multi-step tasks.
Agentic Workflow
AgentsAgentic Workflow: A structured, multi-step process where an autonomous AI agent decomposes a complex goal into sub-tasks, selects tools or models, executes actions, and iteratively adjusts based on feedback, often using a loop of reasoning, acting, and observing.
Anthropic
Companies & ProductsAnthropic is an AI safety company founded in 2021 by former OpenAI employees, best known for developing the Claude family of large language models with a focus on constitutional AI and harm reduction.
Artificial General Intelligence
AgentsArtificial General Intelligence (AGI) is a hypothetical AI system that can understand, learn, and apply knowledge across a wide range of tasks at a level equal to or beyond human cognitive ability, without being limited to narrow domains.
Artificial Superintelligence
AgentsArtificial Superintelligence (ASI) is a hypothetical AI that surpasses human cognitive ability across virtually all domains, including creativity, problem-solving, and social intelligence.
Attention Mechanism
ModelsAttention Mechanism is a neural network component that allows a model to dynamically weigh the importance of different parts of the input when producing each part of the output, enabling focus on relevant context.
B
B100
InfrastructureB100 is an NVIDIA GPU accelerator (Blackwell architecture, 2024) delivering up to 4× training and 30× inference performance over H100 via second-generation Transformer Engine, FP4/FP6 support, and 192 GB HBM3e memory.
B200
InfrastructureB200 is NVIDIA's Blackwell GPU architecture (2024) for AI inference and training, featuring 20.2 PetaFLOPS FP4, 384 GB HBM3e, 8 TB/s memory bandwidth, and 208 billion transistors, targeting large-scale model deployment.
BIG-Bench
EvaluationBIG-Bench (Beyond the Imitation Game Benchmark) is a collaborative, large-scale benchmark of 204 tasks designed to evaluate large language models across reasoning, knowledge, and creativity, testing abilities beyond simple imitation.
BLEU
EvaluationBLEU (Bilingual Evaluation Understudy) is an automatic metric that measures the overlap of n-grams between machine-generated text and one or more reference translations, commonly used for machine translation and text generation evaluation.
Backpropagation
Training & InferenceBackpropagation computes gradients of a loss function with respect to model weights by applying the chain rule of calculus from output back to input, enabling gradient-based optimization via stochastic gradient descent or its variants.
Batch Size
Training & InferenceBatch size is the number of training samples processed before the model's internal parameters are updated. It determines the frequency of gradient descent steps and directly impacts training speed, memory usage, and convergence quality.
Beam Search
Training & InferenceBeam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a limited set (the beam width). In AI/ML training, it is used during sequence generation to keep the top-k candidate sequences at each step, balancing output quality against computational cost.
Blackwell
InfrastructureBlackwell is NVIDIA's GPU architecture for AI and HPC, succeeding Hopper. It integrates a 208B-transistor dual-die design, FP4/FP6 Tensor Cores, and second-gen Transformer Engine, targeting training and inference of trillion-parameter models with up to 30x lower TCO than prior generations.
Browser Use
AgentsBrowser Use is an AI agent paradigm where a language model directly controls a web browser via structured commands (e.g., Playwright, Selenium) to perform multi-step tasks like form filling, data extraction, and transaction processing.
Byte Pair Encoding
Training & InferenceByte Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent pair of adjacent tokens (originally bytes, later characters) until a target vocabulary size is reached, enabling models to handle rare and unknown words via subword decomposition.
C
CUDA
InfrastructureCUDA is a parallel computing platform and API by NVIDIA that allows developers to use GPUs for general-purpose processing, accelerating AI workloads by executing thousands of threads simultaneously.
Catastrophic Forgetting
Training & InferenceCatastrophic forgetting: the tendency of neural networks to lose previously learned knowledge when trained on new tasks or data, a major obstacle to continual learning.
Chain-of-Thought Prompting
Training & InferenceChain-of-Thought Prompting elicits step-by-step reasoning from LLMs by providing intermediate reasoning steps in the prompt, improving performance on complex arithmetic, logic, and multi-step problems.
Chatbot Arena
EvaluationChatbot Arena is a crowdsourced platform where users anonymously pit LLMs against each other in blind side-by-side comparisons, generating human preference rankings and Elo scores for model evaluation.
Claude
Companies & ProductsClaude is a family of large language models developed by Anthropic, designed for safety, helpfulness, and honesty using constitutional AI and RLHF techniques.
Code Interpreter
AgentsCode Interpreter is an agentic tool that enables an LLM to write, execute, and iterate on code in a sandboxed runtime environment, bridging natural language and programmatic computation.
Cohere
Companies & ProductsCohere is a Canadian enterprise AI company that builds large language models (LLMs) optimized for retrieval-augmented generation (RAG), multilingual search, and data privacy. Its flagship Command R+ model series excels in grounding and tool use for business workflows.
Computer Use
AgentsComputer Use is an agentic capability where AI models directly interact with graphical user interfaces (GUIs) of software applications by controlling a virtual mouse and keyboard, enabling them to execute multi-step tasks across arbitrary desktop or web environments.
Constitutional AI
Training & InferenceConstitutional AI is a training method that aligns language models using a set of written principles (a constitution) and self-critique, reducing reliance on human feedback. It combines supervised fine-tuning with reinforcement learning from AI feedback (RLAIF).
Context Window
Training & InferenceThe context window is the maximum number of tokens (words, subwords, or characters) a language model can process at once, determining how much preceding text it can attend to when generating output.
Continual Pretraining
Training & InferenceContinual pretraining extends a pretrained language model's knowledge by training on new, often domain-specific data, without catastrophic forgetting, using techniques like replay, regularization, or architectural isolation.
Continuous Batching
Training & InferenceContinuous batching is a GPU memory management technique for LLM training and inference that eliminates per-iteration padding by dynamically adding new sequences into a running batch as completed sequences finish.
Contrastive Learning
Training & InferenceContrastive learning is a self-supervised training paradigm that learns representations by pulling similar (positive) pairs together and pushing dissimilar (negative) pairs apart in embedding space, using a contrastive loss like InfoNCE.
Convolutional Neural Network
ModelsConvolutional Neural Network (CNN) is a deep learning architecture that applies learned convolutional filters to grid-structured data, primarily images, to hierarchically extract spatial features like edges, textures, and objects.
Cross-Entropy Loss
Training & InferenceCross-Entropy Loss measures the difference between two probability distributions — typically the true labels and the model's predictions — and is the standard loss function for multi-class classification tasks in neural networks.
Curriculum Learning
Training & InferenceCurriculum learning is a training strategy where examples are presented to a model in a meaningful order—typically from easy to hard—to improve convergence speed, final accuracy, or generalization.
D
Data Parallelism
InfrastructureData parallelism splits a training batch across multiple devices (GPUs/TPUs), each holding a full model copy. Gradients are aggregated (e.g., all-reduce) to update weights synchronously or asynchronously, enabling training on large datasets.
Decode
Training & InferenceDecode is the process of generating output tokens from a trained neural network, typically during inference or autoregressive generation, by iteratively sampling or selecting the next token based on the model's probability distribution.
Decoder-Only Model
ModelsA decoder-only model is a neural network architecture that processes input sequences autoregressively, generating output tokens one at a time by attending only to previous tokens. It is the dominant architecture for modern large language models (LLMs).
DeepSeek
Companies & ProductsDeepSeek is a Chinese AI research company developing large language models with a focus on cost-efficient training and inference, known for DeepSeek-V2, DeepSeek-R1, and the open-source DeepSeek-Coder series.
DeepSpeed
InfrastructureDeepSpeed is a deep learning optimization library by Microsoft that reduces memory and accelerates training of large models through ZeRO, mixed precision, gradient checkpointing, and efficient sparse attention.
Dense Model
ModelsDense model: a neural network where every parameter is active for every input, using all weights in each forward pass. Contrasts with sparse models (e.g., MoE) that activate only a subset of parameters per token.
Diffusion Model
ModelsDiffusion models are generative models that learn to reverse a gradual noising process, creating high-quality data (images, audio, video) from random noise by iteratively denoising.
Direct Preference Optimization
Training & InferenceDirect Preference Optimization (DPO) is a training method that aligns language model outputs with human preferences without reinforcement learning, using a closed-form loss on preference pairs.
E
Elo Rating
EvaluationElo Rating is a pairwise comparison system that estimates relative skill from match outcomes, adapted from chess to evaluate LLMs by having models compete in head-to-head judgments.
Embedding
Training & InferenceEmbedding: A dense, low-dimensional vector representation of discrete data (words, tokens, nodes, items) that captures semantic or relational similarity in a continuous latent space, learned via neural networks.
Embedding Model
ModelsAn embedding model is a neural network that maps high-dimensional data (text, images, audio) into a low-dimensional vector space, enabling semantic similarity search, clustering, and downstream ML tasks.
Encoder-Decoder
ModelsEncoder-Decoder is a neural architecture that maps an input sequence to a fixed-length context vector via an encoder, then decodes it into an output sequence; foundational for sequence-to-sequence tasks like machine translation.
F
F1 Score
EvaluationF1 Score is the harmonic mean of precision and recall, balancing false positives and false negatives. It ranges from 0 (worst) to 1 (best) and is used when classes are imbalanced.
Faithfulness
EvaluationFaithfulness measures whether a model's generated output accurately reflects the input context or underlying source data without introducing unsupported or contradictory information.
Few-Shot Learning
Training & InferenceFew-Shot Learning is a machine learning paradigm where a model learns to perform a task from only a small number of training examples (typically 1–10 per class), leveraging prior knowledge from a base dataset or pre-training.
Fine-Tuning
Training & InferenceFine-tuning adapts a pre-trained model to a specific task by continuing training on a smaller, task-specific dataset, updating all or some model parameters.
FlashAttention
Training & InferenceFlashAttention is an IO-aware exact attention algorithm that computes attention without materializing the full N×N attention matrix to HBM, reducing memory reads/writes and achieving 2–4× speedup over standard PyTorch attention.
Foundation Model
ModelsFoundation models are large-scale machine learning models trained on broad data that can be adapted to a wide range of downstream tasks via fine-tuning or prompting.
Function Call
AgentsFunction Call is a mechanism in LLM-based agents where the model outputs a structured request (e.g., JSON) to invoke an external API or tool, enabling the agent to interact with external systems.
Function Calling
Training & InferenceFunction Calling is a training paradigm where a language model learns to generate structured API calls or tool invocations as part of its output, enabling it to interact with external systems.
G
GAIA
EvaluationGAIA (General AI Assistants) is a benchmark for evaluating general-purpose AI assistants on real-world, multi-step tasks requiring reasoning, tool use, and web browsing, with questions designed to be trivial for humans but challenging for AI.
GB200
InfrastructureGB200 is a high-bandwidth, GPU-to-GPU interconnect technology developed by NVIDIA, enabling direct memory access between GPUs in multi-node clusters for distributed AI training and inference.
GPQA
EvaluationGPQA (Graduate-Level Physics Question Answering) is a benchmark for evaluating AI/ML models on advanced physics reasoning, featuring 448 expert-crafted multiple-choice questions spanning classical mechanics, electromagnetism, quantum mechanics, and thermodynamics.
GPT-5
Companies & ProductsGPT-5 is OpenAI's latest large language model (as of early 2026), succeeding GPT-4. It integrates advanced reasoning, multimodal capabilities, and improved factual accuracy, powering ChatGPT and enterprise APIs.
GPU
InfrastructureA GPU (Graphics Processing Unit) is a specialized processor originally designed for rendering graphics, now essential for accelerating parallel workloads in AI/ML, particularly deep learning training and inference.
GRPO
Training & InferenceGRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that trains language models by comparing responses within a group, avoiding a separate value network. It computes advantages from group-relative rewards, simplifying PPO while maintaining stability.
Gemini
Companies & ProductsGemini is Google DeepMind's family of multimodal large language models, including Ultra, Pro, Flash, and Nano variants, designed for text, image, audio, video, and code understanding.
Generative Adversarial Network
ModelsA Generative Adversarial Network (GAN) is a class of deep learning model where two neural networks—a generator and a discriminator—are trained simultaneously in a competitive game to produce realistic synthetic data, such as images, audio, or text.
Google DeepMind
Companies & ProductsGoogle DeepMind is an AI research lab (merged 2023 from DeepMind and Google Brain) known for breakthroughs in reinforcement learning, AlphaGo, AlphaFold, and foundation models like Gemini.
Grace Hopper
InfrastructureGrace Hopper is a large-scale, multi-modal AI supercomputer built by Microsoft and NVIDIA, optimized for training cutting-edge generative models and accelerating AI research via massive parallelism and high-bandwidth interconnects.
Gradient Descent
Training & InferenceGradient Descent: An iterative optimization algorithm that minimizes a loss function by repeatedly moving parameters in the direction of steepest descent (negative gradient) computed on the training data.
Greedy Decoding
Training & InferenceGreedy decoding is a deterministic text generation strategy that selects the token with the highest predicted probability at each step, without considering future consequences or exploring alternatives.
Grok
Companies & ProductsGrok is a large language model developed by xAI, designed to provide real-time, conversational responses with a focus on humor and directness, integrated with X (formerly Twitter) data.
Guardrails
AgentsGuardrails are programmable constraints and validation layers applied to AI agent outputs to enforce safety, policy compliance, and behavioral boundaries. They intercept inputs and outputs, running checks via classifiers, LLM judges, or rule-based systems before actions are executed or responses del
H
H100
InfrastructureH100 is NVIDIA's Hopper-architecture GPU for AI and HPC, featuring 80 GB HBM3 memory, 3.35 TB/s bandwidth, and Transformer Engine for mixed-precision training.
H200
InfrastructureH200 is NVIDIA's data center GPU (based on Hopper architecture) optimized for AI training and inference, featuring 141GB HBM3e memory (4.8 TB/s bandwidth) and FP8 Tensor Core support.
HBM
InfrastructureHBM (High Bandwidth Memory) is a 3D-stacked DRAM technology that provides extremely high bandwidth and low power consumption for AI accelerators, enabling large model training and inference.
Hallucination
EvaluationHallucination in AI/ML is when a model generates factually incorrect, nonsensical, or fabricated content that appears plausible, often due to training data gaps, sampling errors, or lack of grounding.
HellaSwag
EvaluationHellaSwag is a benchmark for evaluating commonsense natural language inference, testing models on sentence completion where the correct ending requires understanding real-world situations and avoiding adversarial, plausible-sounding but incorrect endings.
Hopper
InfrastructureHopper is NVIDIA's GPU architecture (H100, H200) optimized for large-scale AI training and inference, featuring Transformer Engine (FP8), NVLink/NVSwitch, and up to 141 GB HBM3e memory.
Hugging Face
Companies & ProductsHugging Face is a company and platform that provides open-source libraries, pretrained models, and a collaborative hub for natural language processing and machine learning, notably the Transformers library.
HumanEval
EvaluationHumanEval is an evaluation benchmark consisting of 164 hand-written Python programming problems, each with unit tests, used to measure the functional correctness of code generated by large language models.
Humanity's Last Exam
EvaluationHumanity's Last Exam is a 2025 benchmark of ~3,000 expert-crafted questions across STEM and humanities designed to be the hardest test for AI, with no known public solutions, used to measure frontier model capabilities.
Hybrid Reasoning Model
ModelsHybrid Reasoning Model: an AI architecture combining symbolic logic (rule-based) with neural network learning to handle both deductive reasoning and pattern recognition, improving accuracy and explainability over pure deep learning.
Hyena
ModelsHyena is a family of subquadratic attention-free sequence models that replace the quadratic self-attention mechanism with implicit long convolutions and element-wise gating, achieving linear-time inference and competitive quality on language and genomics tasks.
I
In-Context Learning
Training & InferenceIn-Context Learning (ICL) is a capability of large language models to perform tasks by conditioning on a prompt containing demonstrations or instructions, without updating model parameters. It leverages patterns in the input context to infer the desired output.
Inference
Training & InferenceInference is the process of using a trained machine learning model to generate predictions or outputs from new input data, distinct from training, and optimized for low latency and throughput.
Inferentia
InfrastructureInferentia is Amazon Web Services' custom ASIC chip designed for high-throughput, low-latency machine learning inference, optimized for cost-effective deployment of models like Transformers and CNNs.
InfiniBand
InfrastructureInfiniBand is a high-bandwidth, low-latency networking technology used to connect servers in AI/HPC clusters, enabling efficient distributed training of large models like GPT-4 and Llama 3.1.
Instruction Tuning
Training & InferenceInstruction tuning is a supervised fine-tuning method where a pretrained language model is trained on diverse (instruction, response) pairs to improve its ability to follow natural language prompts and generalize to unseen tasks.
J
K
KV Cache
Training & InferenceKV Cache (Key-Value Cache) is a memory structure in transformer-based LLMs that stores the key and value tensors from previous attention computations during autoregressive decoding, avoiding redundant recomputation and enabling efficient token-by-token generation.
Kernel
InfrastructureKernel: In AI/ML infrastructure, a kernel is a low-level program that manages hardware resources (GPU, CPU, memory) and provides a secure abstraction layer for executing model training and inference workloads, typically as part of an operating system or a specialized runtime like CUDA.
Knowledge Distillation
Training & InferenceKnowledge distillation is a model compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, often transferring soft probabilistic outputs or intermediate representations.
L
Large Language Model
ModelsLarge Language Models (LLMs) are neural networks with hundreds of billions of parameters trained on massive text corpora to predict and generate human-like text. They power chatbots, code generation, and translation via autoregressive next-token prediction.
Latency
Training & InferenceLatency in training is the time delay between sending a batch of data to a compute device and receiving the gradient update, dominated by communication overhead, synchronization barriers, and device idle time.
Latent Diffusion
ModelsLatent Diffusion is a class of generative models that learn to denoise compressed image representations (latents) instead of raw pixels, enabling high-quality synthesis with reduced computational cost.
Learning Rate
Training & InferenceLearning rate is a hyperparameter controlling the step size at each iteration while moving toward a minimum of a loss function, determining how quickly or slowly a model updates its weights during training.
Liquid Cooling
InfrastructureLiquid cooling is a thermal management method for high-performance computing hardware that uses a liquid coolant—typically water, dielectric fluid, or refrigerant—to absorb and transfer heat away from components more efficiently than air cooling.
LiveCodeBench
EvaluationLiveCodeBench is a dynamic benchmark for evaluating code generation models on fresh, unpublished programming problems, replacing static datasets like HumanEval to prevent data contamination.
Llama
Companies & ProductsLlama is a family of large language models (LLMs) developed by Meta AI, released as open-weight models for research and commercial use, setting benchmarks in efficiency and performance.
LoRA
Training & InferenceLoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable low-rank matrices into attention layers, drastically reducing memory and compute requirements while retaining model quality.
Logit Bias
Training & InferenceLogit bias is a technique that adds a constant offset to the logits (pre-softmax outputs) of specific tokens during autoregressive generation, steering the model's token probabilities without retraining.
Long Short-Term Memory
ModelsLong Short-Term Memory (LSTM) is a recurrent neural network architecture designed to learn long-range dependencies by using gating mechanisms (input, forget, output gates) and a cell state to mitigate the vanishing gradient problem.
Long-Term Memory
AgentsLong-Term Memory in AI agents refers to persistent storage and retrieval of information across sessions, enabling agents to recall past interactions, user preferences, and learned knowledge using vector databases, key-value stores, or fine-tuned model weights.
Loss Function
Training & InferenceA loss function quantifies the error between a model's predictions and the true targets during training, guiding gradient-based optimization. Lower loss indicates better fit.
M
MI300X
InfrastructureAMD MI300X is a high-performance GPU accelerator designed for AI training and inference, featuring 192 GB HBM3 memory and 5.2 TB/s bandwidth, competing with NVIDIA H100.
MMLU
EvaluationMMLU (Massive Multitask Language Understanding) is a benchmark that measures LLM knowledge across 57 subjects including STEM, humanities, and social sciences, using multiple-choice questions.
MMLU-Pro
EvaluationMMLU-Pro is an expanded, harder version of the Massive Multitask Language Understanding (MMLU) benchmark, designed to reduce ceiling effects by adding more challenging questions, increasing answer choices from 4 to 10, and removing noisy or trivial items.
Mamba
ModelsMamba is a state-space model (SSM) architecture for sequence modeling that achieves linear-time inference and training, outperforming Transformers on long-range tasks while matching their quality on language modeling.
Megatron-LM
InfrastructureMegatron-LM is a distributed training framework from NVIDIA that enables training of large language models across many GPUs using model parallelism, tensor parallelism, and pipeline parallelism.
Memory in Agents
AgentsMemory in agents is the mechanism enabling an AI system to retain, recall, and utilize information across interactions, including short-term context windows, long-term vector stores, and episodic buffers for reasoning and personalization.
Meta AI
Companies & ProductsMeta AI is the artificial intelligence research and development division of Meta Platforms, responsible for open-source large language models like Llama, computer vision systems, and foundational AI research.
Mistral
Companies & ProductsMistral is a French AI company founded in 2023 that develops open-weight large language models, including Mistral 7B, Mixtral 8x7B, and Mistral Large, known for their efficiency and strong performance on reasoning and multilingual tasks.
Mixture of Experts
ModelsMixture of Experts (MoE) is a neural network architecture that divides computation across multiple specialized sub-networks ('experts'), each activated by a learned gating mechanism for different inputs, enabling massive model scale with sub-linear compute cost.
Model Context Protocol
AgentsModel Context Protocol (MCP) is an open standard that defines how AI agents exchange structured context—such as tool schemas, conversation history, and user intent—with external systems, enabling modular, interoperable agent architectures.
Multi-Agent System
AgentsA Multi-Agent System (MAS) is a computational framework where multiple autonomous AI agents interact, coordinate, and collaborate to solve complex tasks that exceed the capability of a single agent.
Multi-Head Attention
ModelsMulti-Head Attention is a neural network mechanism that runs multiple parallel attention operations (heads) over the same input, allowing the model to jointly attend to information from different representation subspaces at different positions.
Multimodal Model
ModelsA multimodal model processes and generates data across multiple modalities (text, image, audio, video) within a single unified architecture, often using a shared latent space or cross-attention mechanisms.
N
NVIDIA
Companies & ProductsNVIDIA is a technology company that designs graphics processing units (GPUs) and system-on-a-chip units for AI, HPC, and gaming. Its CUDA platform and Tensor Core GPUs dominate the AI training and inference market, powering most large language models and generative AI systems as of 2026.
NVLink
InfrastructureNVLink is a high-bandwidth, low-latency GPU-to-GPU interconnect developed by NVIDIA, enabling direct memory access and fast data transfer between multiple GPUs for scalable AI training and inference.
O
OSWorld
EvaluationOSWorld is a benchmark for evaluating multimodal AI agents on open-ended computer tasks across multiple operating systems (Ubuntu, macOS, Android), requiring screen understanding, mouse/keyboard actions, and multi-step reasoning.
OpenAI
Companies & ProductsOpenAI is an AI research and deployment company founded in 2015, creator of GPT-4, GPT-4o, DALL·E, and Whisper. It develops large language models, multimodal systems, and reinforcement learning agents, transitioning from non-profit to capped-profit in 2019.
Opus
Companies & ProductsOpus is Anthropic's most advanced family of large language models, optimized for deep reasoning, complex analysis, and high-stakes accuracy, powering Claude 3 Opus and its successor Claude 3.5 Opus (2025).
Orchestrator-Worker
AgentsOrchestrator-Worker is a multi-agent architecture where a central Orchestrator decomposes tasks, delegates subtasks to specialized Worker agents, and synthesizes their outputs into a final result. It enables scalable, modular, and robust agentic systems.
P
PPO
Training & InferencePPO (Proximal Policy Optimization) is a reinforcement learning algorithm that stabilizes training by constraining policy updates within a trust region, widely used for fine-tuning large language models with human feedback.
PagedAttention
Training & InferencePagedAttention is a memory management technique for transformer inference that handles key-value (KV) cache as non-contiguous blocks (pages), analogous to virtual memory paging in operating systems, enabling near-zero waste and efficient sharing across sequences.
Pass@k
EvaluationPass@k measures how often at least one of k generated samples from an AI model contains a correct answer, used primarily for code generation and math reasoning tasks.
Perplexity
EvaluationPerplexity measures how well a language model predicts a sequence of tokens, calculated as the exponential of the average negative log-likelihood. Lower values indicate better predictive performance.
Pipeline Parallelism
InfrastructurePipeline parallelism splits a neural network into sequential stages, each on a different device, enabling training of models too large for a single GPU by overlapping computation across devices.
Planning Agent
AgentsA Planning Agent is an AI system that generates and executes multi-step action sequences to achieve a goal, using search, optimization, or learned heuristics to decompose complex tasks into ordered subtasks.
Post-Training Quantization
Training & InferencePost-Training Quantization (PTQ) reduces the numerical precision of a trained model's weights and activations (e.g., from FP32 to INT8) without retraining, lowering memory footprint and inference latency.
Prefill
Training & InferencePrefill is the initial processing phase in autoregressive language models where the input prompt is computed in parallel to generate the first output token, leveraging attention caching to avoid recomputation.
Prefix Tuning
Training & InferencePrefix Tuning is a parameter-efficient fine-tuning method that prepends a small set of trainable continuous vectors (a "prefix") to the hidden states of each transformer layer, keeping the original model weights frozen.
Pretraining
Training & InferencePretraining is the initial, large-scale unsupervised or self-supervised training phase of a foundation model on a broad, unlabeled corpus to learn general linguistic or multimodal patterns before task-specific fine-tuning.
Prompt Tuning
Training & InferencePrompt tuning is a parameter-efficient fine-tuning method that learns a small set of soft virtual tokens prepended to the input embedding, keeping the pretrained model frozen. It adapts a foundation model for a downstream task by optimizing only these learned prompt vectors.
Pruning
Training & InferencePruning is a model compression technique that removes unnecessary weights or neurons from a neural network to reduce size and computational cost while preserving accuracy.
Q
QLoRA
Training & InferenceQLoRA (Quantized Low-Rank Adaptation) is a memory-efficient fine-tuning method that combines 4-bit NormalFloat quantization of a base model with Low-Rank Adaptation (LoRA) adapters, enabling full fine-tuning of large language models on a single consumer GPU.
Quantization
Training & InferenceQuantization reduces the numerical precision of a model's weights and activations (e.g., from 32-bit floats to 8-bit integers), shrinking memory footprint and accelerating inference with minimal accuracy loss.
Quantization-Aware Training
Training & InferenceQuantization-Aware Training (QAT) is a technique that simulates quantization effects during neural network training, enabling the model to learn weights and activations robust to low-precision inference, yielding higher accuracy than post-training quantization.
R
RAG
ModelsRetrieval-Augmented Generation (RAG) is a hybrid model architecture that combines a retrieval system (e.g., dense passage retrieval) with a generative language model (e.g., GPT-4) to produce factually grounded, up-to-date responses by fetching relevant external knowledge at inference time.
RLAIF
Training & InferenceRLAIF (Reinforcement Learning from AI Feedback) is a training method that replaces human preference judgments in RLHF with an AI judge, typically a large language model, to generate preference labels for optimizing a policy model.
RLHF
Training & InferenceRLHF (Reinforcement Learning from Human Feedback) is a training method that aligns language models with human preferences by using human-rated outputs as reward signals, typically via a reward model and PPO optimization.
ROCm
InfrastructureROCm (Radeon Open Compute) is AMD's open-source software platform for GPU-accelerated machine learning and HPC, providing a CUDA-like runtime, compiler (HIP), and libraries targeting AMD Instinct GPUs.
ROUGE
EvaluationROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics comparing a generated text against reference summaries, measuring recall of n-grams, word sequences, and word pairs.
RWKV
ModelsRWKV is a neural network architecture that combines the parallelizable training of Transformers with the efficient inference of RNNs, using a linear attention mechanism and a time-mixing formulation.
ReAct Pattern
AgentsThe ReAct pattern combines reasoning and acting in LLM agents, interleaving chain-of-thought traces with tool calls to ground inference in external actions and observations, improving accuracy and interpretability.
Reasoning Model
ModelsReasoning models are AI systems designed to perform multi-step logical deduction, planning, or mathematical inference, often using chain-of-thought prompting, search, or structured symbolic methods to produce verifiable outputs.
Recurrent Neural Network
ModelsRecurrent Neural Network (RNN): a neural network architecture designed to process sequential data by maintaining a hidden state that captures information about previous inputs, enabling tasks like language modeling and time-series prediction.
Red Teaming
EvaluationRed teaming is a structured adversarial evaluation method where a team of human testers or automated systems deliberately attempts to elicit harmful, biased, or otherwise unsafe outputs from an AI model to identify vulnerabilities before deployment.
Reflection
AgentsReflection is an agentic pattern where an LLM evaluates its own outputs, iteratively refining them based on self-generated critique or external feedback.
Reranker
ModelsReranker is a second-stage model that re-scores a small set of candidate documents or passages retrieved by a fast first-stage retriever, improving ranking accuracy by applying deeper cross-attention between query and candidate.
Retrieval-Augmented Generation
ModelsRetrieval-Augmented Generation (RAG) is a hybrid model architecture that combines a retrieval system (e.g., dense passage retrieval) with a generative language model (e.g., GPT-4) to produce factually grounded, up-to-date responses by fetching relevant external knowledge at inference time.
Reward Model
Training & InferenceA reward model is a neural network trained to predict human preference scores for model outputs, used as a proxy reward signal in reinforcement learning from human feedback (RLHF) to align language models with human values.
Runway
Companies & ProductsRunway is an AI research company and platform for generative video, image, and 3D content creation, known for tools like Gen-2, Gen-3 Alpha, and the Green Screen feature.
S
SWE-bench
EvaluationSWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks, requiring models to generate patches for GitHub issues from codebases.
Sampling Temperature
Training & InferenceSampling temperature is a hyperparameter that controls the randomness of token generation in language models by scaling the logits before softmax, with lower values (e.g., 0.1) producing more deterministic outputs and higher values (e.g., 1.5) increasing diversity.
Sandbox
AgentsA sandbox is an isolated execution environment for AI agents, enforcing constraints on actions (e.g., file I/O, network calls, code execution) to prevent unintended side effects while allowing controlled interaction with tools and data.
Scaling Laws
Training & InferenceScaling laws describe predictable relationships between model performance and key training factors: dataset size, parameter count, and compute budget. They guide resource allocation and suggest that larger models and more data, up to a point, yield diminishing but reliable returns.
Self-Attention
ModelsSelf-Attention is a neural network mechanism that computes a weighted sum over all positions in a sequence, allowing each element to directly attend to every other element for capturing long-range dependencies.
Self-Critique
AgentsSelf-Critique is a method where an LLM evaluates and refines its own outputs by generating feedback or corrections, often via multi-turn prompting or dedicated critique models.
Self-Supervised Learning
Training & InferenceSelf-Supervised Learning (SSL) trains representations from unlabeled data by creating pretext tasks where the model predicts parts of the input from other parts, enabling pre-training on massive corpora without human annotations.
Sequence Parallelism
InfrastructureSequence Parallelism is a distributed training technique that splits a single long input sequence across multiple devices along the sequence dimension, enabling the training of models with very long context windows that would otherwise exceed single-device memory.
Small Language Model
ModelsA Small Language Model (SLM) is a compact neural network with typically fewer than 10 billion parameters, designed to perform language tasks with reduced computational cost, lower latency, and feasibility for on-device deployment while often sacrificing some accuracy versus large models.
Sonnet
Companies & ProductsSonnet is a series of large language models (LLMs) developed by Anthropic, a subset of the Claude model family optimized for speed, cost-efficiency, and reliable performance in production workloads.
Sparse MoE
ModelsSparse Mixture of Experts (Sparse MoE) is a neural network architecture that activates only a subset of parameters per input token, scaling model capacity without proportional compute cost.
Sparsity
Training & InferenceSparsity is the property of a matrix or tensor where most elements are zero, exploited in AI/ML to reduce memory footprint and computation by storing and operating only on non-zero values.
Speculative Decoding
Training & InferenceSpeculative decoding is an inference-time technique that uses a small draft model to generate candidate tokens, which are then verified in parallel by a large target model, achieving speedups without modifying the target model's weights.
Speech Recognition Model
ModelsA speech recognition model is a machine learning system that transcribes spoken audio into text, typically using an end-to-end deep neural network trained on thousands of hours of labeled speech data.
Stability AI
Companies & ProductsStability AI is a London-based generative AI company best known for developing Stable Diffusion, a family of open-weight text-to-image models. It also builds models for audio, video, 3D, and code, and operates the DreamStudio platform.
State Space Model
ModelsState Space Models (SSMs) are a class of sequence models that map inputs to hidden states via linear differential or difference equations, offering efficient parallel training and linear-time inference for long sequences.
Streaming
Training & InferenceStreaming in ML training refers to processing data as a continuous, incremental flow rather than loading a static dataset entirely into memory, enabling training on unbounded data or hardware with limited RAM.
Structured Output
Training & InferenceStructured output in AI/ML refers to training or prompting techniques that force a model to generate responses conforming to a predefined schema (e.g., JSON, XML, typed lists), enabling reliable parsing and downstream automation.
Subagent
AgentsSubagent is a subordinate AI agent that operates within a larger multi-agent system, receiving tasks, goals, or constraints from a primary agent and executing specialized subtasks with limited autonomy.
Supervised Fine-Tuning
Training & InferenceSupervised Fine-Tuning (SFT) adapts a pretrained model on labeled input-output pairs to specialize its behavior for a downstream task, using standard supervised learning loss (e.g., cross-entropy on tokens).
T
TGI
Training & InferenceTGI (Text Generation Inference) is a high-performance, open-source inference server for large language models, developed by Hugging Face for production deployment.
TPU
InfrastructureTPU (Tensor Processing Unit) is Google's custom ASIC designed to accelerate tensor computations for neural network training and inference, offering high throughput and energy efficiency for large-scale AI workloads.
Task Decomposition
AgentsTask decomposition breaks a complex goal into smaller, manageable sub-tasks, often executed sequentially or in parallel by an agent. It enables reasoning, planning, and error recovery in multi-step workflows.
Tensor Core
InfrastructureTensor Cores are specialized hardware units in NVIDIA GPUs (Volta+; also in Hopper/Blackwell) that perform fused multiply-add on 4×4 matrices in one cycle, accelerating mixed-precision training and inference of deep neural networks.
Tensor Parallelism
InfrastructureTensor Parallelism splits individual tensor operations (e.g., matrix multiplies) across multiple devices, each holding a shard of the weights, to reduce per-device memory and compute for large models.
Text-to-Speech Model
ModelsA text-to-speech model converts written text into natural-sounding spoken audio using deep learning. It typically combines neural text analysis, acoustic feature prediction, and a vocoder for waveform generation.
Throughput
Training & InferenceThroughput in training is the rate at which a system processes training examples per unit time, typically measured in samples per second or tokens per second.
Time to First Token
Training & InferenceTime to First Token (TTFT) is the latency from submitting a prompt to a language model to receiving the first output token, critical for real-time applications like chatbots.
Tokenization
Training & InferenceTokenization converts raw text into discrete units (tokens) — words, subwords, or characters — that a model can process. It determines vocabulary size, sequence length, and how out-of-vocabulary words are handled, directly impacting training efficiency and model quality.
Tokens per Second
Training & InferenceTokens per second (TPS) measures the number of input or output tokens a model processes per second during training, directly reflecting training throughput and hardware utilization.
Tool Use
AgentsTool Use is the capability of an AI agent to call external functions, APIs, databases, or software tools to accomplish tasks beyond its intrinsic knowledge, enabling dynamic information retrieval, computation, and action execution.
Tool-Use Model
ModelsA Tool-Use Model is an AI system that can interact with external tools (APIs, databases, calculators, code interpreters) to perform actions beyond its static knowledge, extending its capabilities for tasks like web search, math, or code execution.
Top-k Sampling
Training & InferenceTop-k sampling is a text generation method that restricts the next-token choice to the k most probable tokens, then renormalizes and samples from that subset, balancing diversity and coherence.
Top-p Sampling
Training & InferenceTop-p Sampling (nucleus sampling) selects tokens from the smallest set whose cumulative probability exceeds threshold p, replacing fixed top-k with dynamic vocabulary truncation during text generation.
Trainium
InfrastructureTrainium is Amazon Web Services' custom ASIC machine learning accelerator, optimized for training deep neural networks, offering up to 50% cost savings over GPU-based instances for supported workloads.
Transformer
ModelsTransformer is a neural network architecture based solely on attention mechanisms, introduced in 'Attention Is All You Need' (2017). It processes sequences in parallel via self-attention and position-wise feed-forward layers, enabling efficient training and state-of-the-art performance in NLP, visio
Triton
InfrastructureTriton is an open-source language and compiler for writing high-performance GPU kernels, developed by OpenAI. It abstracts low-level CUDA details to simplify custom operator development.
TruthfulQA
EvaluationTruthfulQA is a benchmark that measures whether large language models generate truthful answers by testing for common misconceptions and false beliefs across 38 categories.
V
Variational Autoencoder
ModelsA Variational Autoencoder (VAE) is a generative model that learns a latent variable representation of input data by combining neural networks with variational inference, enabling the generation of new samples similar to the training distribution.
Vector Database
InfrastructureA vector database is a specialized database that indexes and queries high-dimensional vector embeddings generated by machine learning models, enabling similarity search, retrieval-augmented generation (RAG), and semantic matching at scale.
Vision Language Model
ModelsA vision-language model (VLM) processes images and text jointly, enabling tasks like image captioning, visual question answering, and document understanding by aligning visual features with language representations.
Vision Transformer
ModelsVision Transformer (ViT) is a neural network architecture that applies the Transformer encoder directly to image patches, treating them as token sequences, achieving state-of-the-art image classification without convolution.
vLLM
Training & InferencevLLM is an open-source inference engine that uses PagedAttention to manage key-value cache memory, achieving near-zero memory waste and up to 24x higher throughput for large language models.
W
WebArena
EvaluationWebArena is a benchmark for evaluating autonomous web agents on realistic, multi-step tasks in a controlled environment, measuring task completion, efficiency, and robustness.
Working Memory
AgentsWorking Memory in AI agents is a limited-capacity, persistent slot that stores recent observations, intermediate reasoning steps, or task context across multiple inference calls, enabling coherent multi-step behavior without full retraining.
World Model
ModelsA world model is a learned internal representation of an environment that an AI system uses to simulate possible futures, plan actions, and reason causally, often trained via self-supervised or reinforcement learning.
X
Z
ZeRO
InfrastructureZeRO is a memory optimization technique for distributed deep learning that partitions model states (parameters, gradients, optimizer states) across data-parallel processes, eliminating memory redundancy while maintaining computational granularity.
Zero-Shot Learning
Training & InferenceZero-shot learning (ZSL) trains a model to recognize classes never seen during training by leveraging semantic side information (e.g., attributes, word embeddings) to bridge seen and unseen categories.