Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

token optimization

30 articles about token optimization in AI news

arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference

A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.

95% relevant

Meta's REFRAG: The Optimization Breakthrough That Could Revolutionize RAG Systems

Meta's REFRAG introduces a novel optimization layer for RAG architectures that dramatically reduces computational overhead by selectively expanding compressed embeddings instead of tokenizing all retrieved chunks. This approach could make large-scale RAG deployments significantly more efficient and cost-effective.

85% relevant

vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster

vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.

82% relevant

B200 PD Disaggregation Boosts Token Throughput 7x, Slashes Cost

B200 clusters with PD disaggregation over RoCEv2 Ethernet achieve 7x token throughput, cutting cost per million tokens 7x.

85% relevant

Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow

A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.

100% relevant

TACO Framework Cuts Agent Token Overhead 10% via Self-Evolving Compression

Researchers introduced TACO, a framework that enables terminal agents to automatically discover and refine context compression rules from their own interaction trajectories. This approach cuts token overhead by approximately 10% on benchmarks like TerminalBench and SWE-Bench Lite while preserving task accuracy.

87% relevant

Install token-ninja: The MCP Server That Saves Tokens on Common Shell Commands

A new MCP server, token-ninja, automatically runs simple shell commands locally instead of sending them to Claude, cutting token usage and speeding up your workflow.

100% relevant

Nvidia: Cost Per Token Is the Only AI Infrastructure Metric That Matters

Nvidia asserts that total cost of ownership for AI infrastructure must be measured in cost per delivered token, not raw compute metrics. This shift is critical for scaling profitable agentic AI applications.

80% relevant

Google's Gemma 4B Model Runs on Nintendo Switch at 1.5 Tokens/Second

A developer successfully ran Google's 4-billion parameter Gemma language model on a Nintendo Switch, achieving 1.5 tokens/second inference. This demonstrates the increasing feasibility of running small LLMs on consumer-grade edge hardware.

89% relevant

Code-Review-Graph Cuts Claude Token Usage 8.2x with Local Knowledge Graph

A developer released 'code-review-graph,' an open-source tool that uses Tree-sitter to build a persistent structural map of a codebase. This allows Claude to read only relevant files, cutting average token usage by 8.2x across six real repositories.

95% relevant

Gemma 4 26B A4B Hits 45.7 tokens/sec Decode Speed on MacBook Air via MLX Community

A community benchmark shows the Gemma 4 26B A4B model running at 45.7 tokens/sec decode speed on a MacBook Air using the MLX framework. This highlights rapid progress in efficient local deployment of mid-size language models on consumer Apple Silicon.

93% relevant

CLAUDE.md Promises 63% Reduction in Claude Output Tokens with Drop-in Prompt File

A new prompt engineering file called CLAUDE.md claims to reduce Claude's output token usage by 63% without code changes. The drop-in file aims to make Claude's code generation more efficient by structuring its responses.

87% relevant

DACT: A New Framework for Drift-Aware Continual Tokenization in Generative Recommender Systems

Researchers propose DACT, a framework to adapt generative recommender systems to evolving user behavior and new items without costly full retraining. It identifies 'drifting' items and selectively updates token sequences, balancing stability with plasticity. This addresses a core operational challenge for real-world, dynamic recommendation engines.

86% relevant

Fireworks AI Launches 'Fire Pass' with Kimi K2.5 Turbo at 250 Tokens/Second

Fireworks AI has launched a new 'Fire Pass' subscription offering access to Kimi K2.5 Turbo at speeds up to 250 tokens/second. The service includes a free trial followed by a $7 weekly subscription.

85% relevant

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

79% relevant

Tamp Compression Proxy Cuts Claude Code Token Usage 52% — Zero Code Changes

Run a local proxy that automatically compresses Claude Code's API calls, cutting token usage in half without modifying your workflow.

87% relevant

Stop Claude Code's Web Fetches from Burning 700K Tokens on HTML Junk

A new MCP server, token-enhancer, strips scripts, nav bars, and ads from web pages before they hit Claude's context, cutting token waste by 90%+.

84% relevant

Stop Wasting Tokens in Your CLAUDE.md: The Layered Configuration System

Separate global, project, and file-type rules into different CLAUDE.md files to cut token waste and make Claude Code more effective.

95% relevant

Graph Tokenization: A New Method to Apply Transformers to Graph Data

Researchers propose a framework that converts graph-structured data into sequences using reversible serialization and BPE tokenization. This enables standard Transformers like BERT to achieve state-of-the-art results on graph benchmarks, outperforming specialized graph models.

70% relevant

Decoding the First Token Fixation: How LLMs Develop Structural Attention Biases

New research reveals how large language models develop 'attention sinks'—disproportionate focus on the first input token—through a simple circuit mechanism that emerges early in training. This structural bias has significant implications for model interpretability and performance.

75% relevant

HyperTokens Break the Forgetting Cycle: A New Architecture for Continual Multimodal AI Learning

Researchers introduce HyperTokens, a transformer-based system that generates task-specific tokens on demand for continual video-language learning. This approach dramatically reduces catastrophic forgetting while maintaining fixed memory costs, enabling AI models to learn sequentially without losing previous knowledge.

75% relevant

CompACT AI Tokenizer Revolutionizes Robotic Planning with 8-Token Compression

Researchers have developed CompACT, a novel AI tokenizer that compresses visual observations into just 8 tokens for robotic planning systems. This breakthrough enables 40x faster planning while maintaining competitive accuracy, potentially transforming real-time robotic control applications.

85% relevant

Headroom AI: The Open-Source Context Optimization Layer That Could Revolutionize Agent Efficiency

Headroom AI introduces a zero-code context optimization layer that compresses LLM inputs by 60-90% while preserving critical information. This open-source proxy solution could dramatically reduce costs and improve performance for AI agents.

95% relevant

OpenAI's GPT-5.4: The Million-Token Context Window That Changes Everything

OpenAI's upcoming GPT-5.4 will feature a groundbreaking 1 million token context window, matching competitors like Gemini and Claude. The model introduces an 'Extreme reasoning mode' for complex tasks and represents a shift toward monthly updates.

95% relevant

Support Tokens: The Hidden Mathematical Structure Making LLMs More Robust

Researchers have discovered a surprising mathematical constraint in transformer attention mechanisms that reveals a 'support token' structure similar to support vector machines. This insight enables a simple but powerful training modification that improves LLM robustness without sacrificing performance.

75% relevant

Diffusion Architecture Breaks Speed Barrier: Inception's Mercury 2 Hits 1,000 Tokens/Second

Inception's Mercury 2 achieves unprecedented text generation speeds of 1,000 tokens per second using diffusion architecture borrowed from image AI. This represents a 10x speed advantage over leading models like Claude 4.5 Haiku and GPT-5 Mini without requiring custom hardware.

95% relevant

Beyond the Token Limit: How Claude Opus 4.6's Architectural Breakthrough Enables True Long-Context Reasoning

Anthropic's Claude Opus 4.6 represents a fundamental shift in large language model architecture, moving beyond simple token expansion to create genuinely autonomous reasoning systems. The breakthrough enables practical use of million-token contexts through novel memory management and hierarchical processing.

70% relevant

DeepSeek v4 Pricing Cuts 75%: $0.43/M Tokens In

DeepSeek v4 API pricing permanently cut 75% to $0.43/M input, $0.87/M output, enabled by 27% compute and 10% cache vs v3.2.

100% relevant

GR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue Lift

Researchers from Kuaishou present GR4AD, a generative recommendation system designed for high-throughput ad serving. It introduces innovations in tokenization (UA-SID), decoding (LazyAR), and optimization (RSPO) to balance performance with cost. Online A/B tests on 400M users show a 4.2% ad revenue improvement.

95% relevant

Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model

A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.

91% relevant