Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart illustrating median coding agent token usage at 96k input tokens, with data points from 432k real…

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

AAAla SMITH & AI Research Desk·May 22, 2026·3 min read··132 views·AI-Generated·Report error

Source: x.comvia @SemiAnalysis_Multi-Source

What is the median input token count for coding agent requests?

SemiAnalysis found the median coding agent request uses 96k input tokens, based on 432k real requests, reshaping inference cost assumptions by making output token pricing less dominant.

TL;DR

Median coding agent uses 96k input tokens. · 432k real agent requests analyzed by SemiAnalysis. · Inference cost structure shifts with agentic workloads.

SemiAnalysis found the median coding agent request uses 96k input tokens. The analysis, pulled from 432k real coding agent requests, shows agentic workloads are reshaping inference cost assumptions.

Key facts

Median coding agent request: 96k input tokens.
Sample size: 432k real coding agent requests.
96k tokens exceeds the text of The Great Gatsby.
Input token volume now dominates inference cost.
Agentic workloads triple typical prompt assumptions.

SemiAnalysis published data from 432k real coding agent requests showing the median input token count is 96k tokens. For context, that exceeds the entire text of The Great Gatsby. The finding challenges conventional inference pricing models that assume output tokens dominate cost.

What the Data Reveals

Most prompt engineering and API pricing assumes prompts of 4k-32k tokens. The median agentic request is 96k input tokens — triple the high end of that range. [According to @SemiAnalysis_], this shifts the cost center from output generation to context processing. Output tokens, while still relevant, become a secondary driver of total inference cost.

Implications for Pricing and Architecture

Current API pricing from providers like Anthropic, OpenAI, and Google often charges more per output token than per input token. With agentic workloads, the input token volume dwarfs output. A 96k input / 1k output request costs far more than a 4k input / 4k output request under standard pricing. This creates an incentive for providers to optimize context handling — via KV-cache compression, sparse attention, or sliding window techniques — rather than pure generation speed.

The finding also suggests agentic systems are not just longer prompts but fundamentally different usage patterns. The median request includes codebase context, conversation history, and tool outputs. That context accumulates over multi-step interactions, making each subsequent request more expensive than the last.

Why This Matters More Than the Press Release Suggests

The unique take: Agentic workloads invert the standard inference cost model. The industry has focused on reducing output token cost (via speculative decoding, quantization) but the real cost driver is now input token volume. Providers that optimize context processing — not generation — will win the agentic inference market.

SemiAnalysis did not disclose the exact distribution tails or the specific agent systems analyzed, but the sample size of 432k requests gives statistical weight. The company also did not specify whether the median includes failed or truncated requests, which could skew lower.

What to watch

Can Higher Temperature Improve LLM Structured Output? | by Srijan ...

Watch for API pricing changes from Anthropic, OpenAI, and Google in the next two quarters — specifically whether they introduce lower input token rates or context-caching features. Also track if agentic framework providers (LangChain, Vercel AI SDK) add cost-aware routing based on context size.

[Updated 23 May via reddit_claude]

The high context consumption in agentic workflows may be creating a new kind of coordination debt. A senior engineer on Reddit reported that two engineers on the same team used Claude Code to add error handling to the same service — one wrapped everything in try/catch with Sentry logging, the other built a custom Result type — both merged the same week, producing two inconsistent patterns caught only in review. The user noted that team-level speedup hasn't materialized despite individual productivity gains, because each developer's AI works from different local context and standards docs remain unread [per Reddit r/ClaudeAI].

Sources cited in this article

Source: gentic.news · May 22, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SemiAnalysis data exposes a structural shift in how inference costs should be modeled. The standard assumption that output tokens dominate is wrong for agentic workloads. This has direct implications for GPU utilization: context processing is memory-bandwidth-bound, not compute-bound. Providers optimizing for memory bandwidth (e.g., with HBM3e or custom attention hardware) will have an advantage. The finding also explains recent trends in model architecture. The move toward longer context windows (128k, 200k, 1M tokens) is not just a feature — it's a necessity driven by real usage. But longer context without efficient attention mechanisms (like Mamba or linear attention) will cripple inference throughput. Contrarian take: The industry's focus on output token optimization (speculative decoding, quantization) may be misallocated. The real bottleneck for agentic inference is input token throughput. Providers that solve context caching and KV-cache reuse will win more than those that shave milliseconds off generation.

#inference #cost-analysis #agentic-ai

Mentioned in this article

SemiAnalysis

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

ByteDance Finds AI Agents Double Learning Speed Every 3 Months

AI Research

Alibaba's Damo Academy AI Agent Discovers 4 New Superconductors in 28 Hours

AI Research

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

AI Research

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

AI Research

Google TPU Humufish Drops TSMC CoWoS for Intel EMIB-T

AI Research

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

What the Data Reveals

Implications for Pricing and Architecture

Why This Matters More Than the Press Release Suggests

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

ByteDance Finds AI Agents Double Learning Speed Every 3 Months

Alibaba's Damo Academy AI Agent Discovers 4 New Superconductors in 28 Hours

Mira Murati's Thinking Machines beats frontier models by 29.8% with Bridgewater's expert judgments

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

Google TPU Humufish Drops TSMC CoWoS for Intel EMIB-T

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

The framework underneath this story

More in AI Research

GraphRAG Memory Design: Retrieval Over Storage, MCP Integration

DARPA AIQ Program Shifts From Benchmarks to Measuring AI Capabilities

GPT-4 Held Top Spot 52 Weeks; Today's Models Last 7