AI inference costs are decreasing by 5-10x annually, with occasional 10-100x jumps for certain capabilities, according to @kimmonismus on X. This trend challenges the narrative in a recent Forbes article that compute costs far exceed employee costs for AI deployment.
Key facts
- Inference costs dropping 5-10x annually per @kimmonismus.
- Occasional 10-100x cost jumps for specific capabilities.
- GPT-3-class inference dropped from $0.02 to under $0.002 per 1K tokens (2022-2026).
- Mixture-of-experts models (e.g., Mixtral 8x7B) enabled ~6x cost reduction in 2024.
- Quantization (FP8 vs FP16) can yield 2x cost improvement.
In a pointed critique of a Forbes article, @kimmonismus argues that the publication overlooks the rapid deflation in AI inference costs. [According to @kimmonismus] inference is becoming 5-10x cheaper each year, with occasional jumps of 10-100x for specific capabilities. The claim suggests that while compute may currently surpass employee costs for some deployments, this imbalance is unlikely to persist for many years.
The Deflation Trajectory
This observation aligns with broader industry trends. For instance, the cost of running GPT-3-class models has fallen from roughly $0.02 per 1K tokens in 2022 to under $0.002 per 1K tokens for similar quality outputs in 2026, per public pricing data from major API providers. The 5-10x annual improvement rate implies a compound effect: a deployment costing $100,000 in inference today could cost $10,000 or less in two years.
Occasional Jumps
The reference to 10-100x jumps for certain capabilities likely points to architectural breakthroughs or hardware optimizations. For example, the shift from dense to mixture-of-experts models (e.g., Mixtral 8x7B) enabled roughly 6x cost reduction for equivalent quality in 2024. Quantization techniques like FP8 vs FP16 can yield 2x improvements, while specialized inference chips (e.g., Groq LPUs) have demonstrated 10x latency improvements for specific workloads.
The Unique Take
Forbes' framing assumes static compute costs, but the reality is a rapid deflation curve. The unique angle here is that the cost structure of AI deployment is not a fixed barrier but a rapidly declining one — making the 'compute vs. labor' calculus a moving target that favors compute over time. This has direct implications for enterprise adoption: the TCO of AI agents will shrink faster than most business planners model.
Caveats
@kimmonismus does not provide specific benchmarks, model names, or timeframes for these cost drops. The claim is a general observation, not a formal analysis. The rate of improvement may vary by model family, hardware generation, and workload type. Inference cost reductions are not uniform across all capabilities; reasoning-heavy tasks like chain-of-thought or code generation may see slower gains than simple generation.
Key Takeaways
- @kimmonismus claims AI inference costs drop 5-10x yearly, challenging Forbes' static compute cost narrative.
- This deflation rate implies rapid TCO reduction for enterprise deployments.
What to watch

Watch for public pricing updates from major inference providers (Anthropic, OpenAI, Google) in Q3 2026 to validate the 5-10x annual deflation claim. Also monitor Groq's LPU pricing for evidence of 10-100x jumps in specific capabilities like real-time transcription.









