What is AgentPerf and why was it created?

AgentPerf is the first benchmark designed for agentic AI workloads, measuring multi-step task completion instead of single LLM calls, because agents chain dozens of calls with tool use.

How does Blackwell Ultra achieve 20x more agents per megawatt?

Through rack-scale integration of 72 GPUs, CUDA kernels that overlap communication with compute, and TensorRT LLM's separate optimization of input processing and output generation.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

NVIDIA DGX GB300 system with Blackwell Ultra chips, benchmark results showing 20x more AI agents per megawatt than…

Products & LaunchesScore: 92

NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper

NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.

AAAla SMITH & AI Research Desk·Jun 12, 2026·5 min read··161 views·AI-Generated·Report error

Source: blogs.nvidia.comvia nvidia_dc_blog, gn_gpu_cluster, gn_dc_power, gn_infiniband, nvidia_blogWidely Reported

Which NVIDIA platform leads the first agentic AI benchmark and by how much?

NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper H200, using DeepSeek V4 Pro.

TL;DR

AgentPerf is first benchmark for agentic AI workloads. · Blackwell Ultra NVL72 runs 20x more agents per megawatt than Hopper. · Benchmark uses DeepSeek V4 Pro MoE model for realism.

Artificial Analysis released AgentPerf, the first benchmark purpose-built for agentic AI workloads. NVIDIA Blackwell Ultra NVL72 leads the initial round, running 20x more agents per megawatt than the Hopper H200 system.

Key facts

AgentPerf is the first benchmark for agentic AI workloads.
Blackwell Ultra NVL72 leads with 20x agents/MW vs Hopper.
Benchmark uses DeepSeek V4 Pro MoE model.
GB300 NVL72 connects 72 GPUs in a rack-scale system.
CUDA kernels overlap communication and compute for efficiency.

Agentic AI Is Not Chat: Why AgentPerf Matters

AgentPerf from Artificial Analysis, the industry’s first agentic AI benchmark, gives developers, enterprises and infrastructure providers a clear way to compare systems for agentic AI According to the NVIDIA blog.

The distinction from traditional inference benchmarks is critical. A single chat completion is a sprint: one LLM call, one response. An agent functions more like a relay: It breaks a goal into many steps and keeps going until the task is done. That results in dozens to hundreds of LLM calls chained together, each passing growing context to the next, with tool calls like code compile and execution, database search and web browsing at every handoff. The complexity isn’t additive; it’s multiplicative.

Existing AI inference benchmarks measure one LLM call: how fast an LLM responds to a single request and how many simultaneous requests a system can handle. They weren’t designed for agentic workloads, where chained LLM calls, tool call delays and growing context stress accelerated computing systems in fundamentally different ways.

Blackwell Ultra NVL72: 20x Agents per Megawatt

In the first round of published results, the NVIDIA Blackwell Ultra NVL72 platform delivers leading performance across the agentic AI workloads tested, running 20x more agents per megawatt than NVIDIA Hopper. The benchmark uses DeepSeek V4 Pro, a large mixture-of-experts (MoE) model representing the class of frontier models powering today’s most capable agents.

Watch NVIDIA CEO Jensen Huang’s GTC Taipei Keynote

On this workload, NVIDIA GB300 NVL72 delivers the highest performance in the benchmark, running up to 20x more agents per megawatt than the NVIDIA HGX H200 system. The performance advantage comes from extreme codesign across the full stack. GB300 NVL72 connects 72 GPUs into a single rack-scale system, enabling large MoE models like DeepSeek V4 Pro to distribute model execution efficiently at scale.

CUDA kernels accelerate this further by overlapping communication and compute, so the cost of coordinating across experts is absorbed rather than added to latency. NVIDIA TensorRT LLM sustains efficiency as concurrent agent sessions scale. For example, it separates the processing of inputs from the generation of outputs so each can be optimized independently.

Benchmark Built on Real Coding Agent Trajectories

AgentPerf is built based on real coding agent trajectories: an agent receives a task, reads files, writes and edits code, executes commands and iterates based on the results. This methodology grounds the benchmark in production-like conditions, unlike synthetic or single-turn evaluations.

NVIDIA GB300 NVL72 supports far more concurrent agents per megawatt than NVIDIA H200 at both service-level objectives of 20 and 60 tokens per second p

For companies building and deploying agents at scale, it’s important to understand how responsive agents are, how many can be deployed simultaneously and how much useful work AI infrastructure can deliver for every dollar and watt invested. The 20x agents-per-megawatt metric directly addresses the total cost of ownership question that enterprises face when scaling agent deployments.
Infrastructure Benchmarking Catches Up to the Agent Shift

Until now, the industry benchmarked AI infrastructure on metrics designed for chatbot inference — tokens per second, time-to-first-token, concurrent sessions. Those numbers told operators how fast a single response arrived, but not how many multi-step tasks a system could complete per unit of energy. AgentPerf flips this: it measures throughput of completed agentic tasks, not raw token generation. The 20x gap between Blackwell Ultra and Hopper suggests that architectural choices — particularly the rack-scale GPU interconnect and communication-compute overlap — matter far more for agent workloads than for chat. The implication for infrastructure buyers: optimizing for chat benchmarks may leave agent performance on the table.

Agents chain together multiple LLM calls and tool calls to gather context, observe, reason and act.

NVIDIA's positioning here is strategic. As OpenAI, Google, and Anthropic race to deploy autonomous coding agents, the infrastructure layer becomes the bottleneck. Blackwell Ultra's lead on agent-specific metrics gives NVIDIA a narrative advantage over competitors like AMD and Intel, who have yet to demonstrate comparable agent-optimized performance. The 20x efficiency gap also pressures hyperscalers building their own custom silicon (Google TPU, AWS Trainium) to validate against agent workloads, not just training and inference.

What to Watch

Watch for the next AgentPerf round to include AMD MI400 and Intel Falcon Shores results, which will reveal whether Blackwell's agent advantage is architectural or merely first-mover. Also track whether Artificial Analysis expands the benchmark to include multi-agent orchestration scenarios, which would stress interconnect bandwidth even harder.

Source: blogs.nvidia.com

Sources cited in this article

H200
Megawatt In

Source: gentic.news · Jun 12, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

AgentPerf represents a necessary correction in AI infrastructure benchmarking. For two years, the industry optimized for chat inference metrics — tokens per second, time-to-first-token — which are poor proxies for agent workloads. An agent that makes 50 chained LLM calls with tool execution between each has a fundamentally different compute profile: memory bandwidth matters more, interconnect latency compounds, and the ratio of compute to communication shifts. NVIDIA's 20x lead over Hopper on agents per megawatt is not just a generational improvement; it's a structural advantage from rack-scale design. The GB300 NVL72's 72-GPU interconnect means large MoE models like DeepSeek V4 Pro can distribute experts across GPUs with minimal cross-node communication. On Hopper, the same model would incur significant PCIe or InfiniBand overhead for every expert routing decision. The CUDA kernel optimization that overlaps communication and compute further reduces the effective latency of multi-step chains. The strategic implication is that NVIDIA is positioning Blackwell as the default infrastructure for autonomous agents — a market that could dwarf chat inference in compute demand. If agents become the primary AI workload, the benchmark advantage translates directly into TCO wins for enterprises. Competitors will need to show they can match this on agent-specific metrics, not just on standard MLPerf inference benchmarks.

#ai infrastructure #agentic ai #benchmarks #nvidia

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Blackwell Ultra NVL72 vs AgentPerf

→

Mentioned in this article

Nvidia Blackwell Ultra NVL72 AgentPerf Artificial Analysis Hopper H200 DeepSeek V4 Pro MoE CUDA Agentic AI

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches3 shared topics

Reverse-engineering Nvidia's cuda-checkpoint reveals 70x cold-start speedup path

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper

Agentic AI Is Not Chat: Why AgentPerf Matters

Blackwell Ultra NVL72: 20x Agents per Megawatt

Benchmark Built on Real Coding Agent Trajectories

What to Watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Nvidia Ships Hundreds of Thousands of Grace Standalone Servers

Nvidia Vera CPU Hits SPECrate 2026: 1.7× AMD Epyc 9755

Claude Hits Azure on Nvidia GB300 Blackwell, GA for Agent Workloads

LANL Taps NVIDIA Vera CPUs for 7x Agentic AI Speed on Scientific Workloads

Alibaba Open-Sources SAIL Stack to Break Nvidia CUDA Lock-In

Reverse-engineering Nvidia's cuda-checkpoint reveals 70x cold-start speedup path

The framework underneath this story

More in Products & Launches

Gemini Robotics ER 2 Hits 60% Video Completeness, Beats 1.6

Claude MCP turns SEO audit into a chat, undercuts $10K agencies

Codex Builds $7K Portfolio Site With Zero Hand-Written Code