Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

token efficiency

30 articles about token efficiency in AI news

OpenAI Engineer Processed 210B Tokens, Sparking AI Efficiency Debate

An OpenAI engineer processed 210 billion tokens in one week, equivalent to 33 Wikipedia-sized datasets. This extreme usage spotlights a growing trend where high AI consumption by engineers leads to a 10x cost increase and a high volume of discarded code.

85% relevant

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

79% relevant

How to Cut Claude Code's Token Costs 32% by Fixing Its Navigation Problem

Claude Code agents waste tokens on grep-style navigation. A new open-source tool gives them IDE-like navigation, cutting costs 32% and doubling efficiency.

92% relevant

NVIDIA's Nemotron 3 Super: The Efficiency-First AI Model Redefining Performance Benchmarks

NVIDIA unveils Nemotron 3 Super, a 120B parameter model with only 12B active parameters using hybrid Mamba-Transformer MoE architecture. It achieves 1M token context, beats GPT-OSS-120B on intelligence metrics, and offers configurable reasoning modes for optimal compute efficiency.

95% relevant

MCP vs CLI: When to Skip MCP Servers and Save 37% on Tokens

Benchmarks show MCP servers can add 37% more input tokens vs. direct CLI commands. Learn when to use CLI for efficiency and when MCP's structure is worth the cost.

95% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

88% relevant

How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and

Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.

86% relevant

Apple Releases DFNDR-12M Dataset, Claims 5x CLIP Training Efficiency

Apple has open-sourced DFNDR-12M, a multimodal dataset of 12.8 million image-text pairs with synthetic captions and pre-computed embeddings. The company claims it enables up to 5x training efficiency over standard CLIP datasets.

85% relevant

TACO Framework Cuts Agent Token Overhead 10% via Self-Evolving Compression

Researchers introduced TACO, a framework that enables terminal agents to automatically discover and refine context compression rules from their own interaction trajectories. This approach cuts token overhead by approximately 10% on benchmarks like TerminalBench and SWE-Bench Lite while preserving task accuracy.

87% relevant

Install token-ninja: The MCP Server That Saves Tokens on Common Shell Commands

A new MCP server, token-ninja, automatically runs simple shell commands locally instead of sending them to Claude, cutting token usage and speeding up your workflow.

100% relevant

Anthropic's Adaptive Thinking: A Compute-Constrained Efficiency Play

Analysis suggests Anthropic's new 'adaptive thinking' feature is a direct response to compute constraints and competitive pressure from OpenAI, aiming to optimize token usage for enterprise clients at the potential cost of consumer experience.

87% relevant

Meta Employee Builds 'Claudeonomics' Dashboard for Internal AI Token Competition

A Meta employee built an internal dashboard called 'Claudeonomics' that ranks coworkers by their usage of company AI tokens, creating a gamified competition and providing a novel view into internal AI tool adoption patterns.

75% relevant

Nvidia: Cost Per Token Is the Only AI Infrastructure Metric That Matters

Nvidia asserts that total cost of ownership for AI infrastructure must be measured in cost per delivered token, not raw compute metrics. This shift is critical for scaling profitable agentic AI applications.

80% relevant

Claude-Mem Plugin Adds Persistent Memory to Claude Code, Cuts Token Use 10x

Developer Akshay Pachaar released Claude-Mem, a free plugin that adds persistent memory across Claude Code sessions. It captures tool usage and implements a 3-layer retrieval system, saving up to 10x tokens.

85% relevant

IAT: Instance-As-Token Compression for Historical User Sequence Modeling

Researchers propose Instance-As-Token (IAT), which compresses all features of each historical interaction into a unified embedding token, then applies standard sequence modeling. This approach outperforms state-of-the-art methods and has been deployed in e-commerce advertising, shopping mall marketing, and live-streaming e-commerce with substantial business metric improvements.

93% relevant

Google's Gemma 4B Model Runs on Nintendo Switch at 1.5 Tokens/Second

A developer successfully ran Google's 4-billion parameter Gemma language model on a Nintendo Switch, achieving 1.5 tokens/second inference. This demonstrates the increasing feasibility of running small LLMs on consumer-grade edge hardware.

89% relevant

Code-Review-Graph Cuts Claude Token Usage 8.2x with Local Knowledge Graph

A developer released 'code-review-graph,' an open-source tool that uses Tree-sitter to build a persistent structural map of a codebase. This allows Claude to read only relevant files, cutting average token usage by 8.2x across six real repositories.

95% relevant

Survey Paper 'The Latent Space' Maps Evolution from Token Generation to Latent Computation in Language Models

Researchers have published a comprehensive survey charting the evolution of language model architectures from token-level autoregression to methods that perform computation in continuous latent spaces. This work provides a unified framework for understanding recent advances in reasoning, planning, and long-context modeling.

85% relevant

Late Interaction Retrieval Models Show Length Bias, MaxSim Operator Efficiency Confirmed in New Study

New arXiv research analyzes two dynamics in Late Interaction retrieval models: a documented length bias in scoring and the efficiency of the MaxSim operator. Findings validate theoretical concerns and confirm the pooling method's effectiveness, with implications for high-precision search systems.

72% relevant

Stop Claude Code's Web Fetches from Burning 700K Tokens on HTML Junk

A new MCP server, token-enhancer, strips scripts, nav bars, and ads from web pages before they hit Claude's context, cutting token waste by 90%+.

84% relevant

Fractal Emphasizes LLM Inference Efficiency as Generative AI Moves to Production

AI consultancy Fractal highlights the critical shift from generative AI experimentation to production deployment, where inference efficiency—cost, latency, and scalability—becomes the primary business constraint. This marks a maturation phase where operational metrics trump model novelty.

76% relevant

Memory Sparse Attention (MSA) Enables 100M Token Context Windows with Minimal Performance Loss

Memory Sparse Attention (MSA) is a proposed architecture that allows AI models to store and reason over massive long-term memory directly within their attention mechanism, eliminating the need for external retrieval systems. The approach reportedly enables context windows of up to 100 million tokens with minimal performance degradation.

85% relevant

Stop Wasting Tokens in Your CLAUDE.md: The Layered Configuration System

Separate global, project, and file-type rules into different CLAUDE.md files to cut token waste and make Claude Code more effective.

95% relevant

Claude Code's Secret Weapon: How the /btw Command Saves Tokens and Keeps You in Flow

Use the /btw command to ask quick, contextual questions without resetting your main task's conversation, saving tokens and preventing workflow interruptions.

95% relevant

Motif CLI: Track Your Claude Code Efficiency with Real-Time AIPM Dashboard

Install Motif CLI to analyze your Claude Code chat history, track AI tokens per minute, and generate personal coding assessments—all locally.

86% relevant

How Adding 'Skills' to MCP Tools Cuts Agent Token Usage by 87%

Adding structured 'skills' descriptions to MCP tools dramatically reduces token consumption in custom agents—here's how to implement it in your Claude Code workflows.

95% relevant

Kimi's Selective Layer Communication Improves Training Efficiency by ~25% with Minimal Inference Overhead

Kimi has developed a method that replaces uniform residual connections with selective information routing between layers in deep AI models. This improves training stability and achieves ~25% better compute efficiency with negligible inference slowdown.

87% relevant

Anthropic's Pricing Revolution: Million-Token Context Now Standard for Claude AI

Anthropic has eliminated the 5x surcharge for million-token contexts in Claude 3 Opus and Claude 3.5 Sonnet, making long-context AI dramatically more affordable. This pricing overhaul removes barriers for developers analyzing large documents, codebases, and datasets.

95% relevant

Three Research Frontiers in Recommender Systems: From Agent-Driven Reports to Machine Unlearning and Token-Level Personalization

Three arXiv papers advance recommender systems: RecPilot proposes agent-generated research reports instead of item lists; ERASE establishes a practical benchmark for machine unlearning; PerContrast improves LLM personalization via token-level weighting. These address core UX, compliance, and personalization challenges.

92% relevant

CompACT AI Tokenizer Revolutionizes Robotic Planning with 8-Token Compression

Researchers have developed CompACT, a novel AI tokenizer that compresses visual observations into just 8 tokens for robotic planning systems. This breakthrough enables 40x faster planning while maintaining competitive accuracy, potentially transforming real-time robotic control applications.

85% relevant