model routing
30 articles about model routing in AI news
CostRouter Emerges as Smart AI Gateway, Cutting API Expenses by 60% Through Intelligent Model Routing
A new API gateway called CostRouter analyzes request complexity and automatically routes queries to the cheapest capable AI model, saving developers up to 60% on API costs while maintaining quality thresholds.
Claude Code Reverse-Engineered: 98.4% of Codebase is Operational Harness
A reverse-engineering analysis of Claude Code reveals only 1.6% of its codebase is AI decision logic, with the rest being operational infrastructure. This challenges current agent design paradigms by prioritizing a robust deterministic harness over complex model routing.
oh-my-claudecode: Open-Source Multi-Agent Orchestration Layer for Claude Code Boosts Speed 3-5x
Developer hasantoxr released oh-my-claudecode, an open-source orchestration layer that adds five execution modes and 32 specialized agents to Claude Code, reportedly delivering 3-5x faster output with automated model routing between Haiku and Opus.
OpenDev Paper Formalizes the Architecture for Next-Generation Terminal AI Coding Agents
A comprehensive 81-page research paper introduces OpenDev, a systematic framework for building terminal-based AI coding agents. The work details specialized model routing, dual-agent architectures, and safety controls that address reliability challenges in autonomous coding systems.
R³AG: A New Routing Framework That Matches Queries to Retriever
R³AG is a novel routing framework that dynamically selects the optimal retriever for each query in RAG systems, considering not just relevance but also how well the retrieved document helps the generator produce correct answers. It uses contrastive learning to model query-specific preferences, consistently outperforming existing methods on knowledge-intensive tasks.
Agno v2: An Open-Source Framework for Intelligent Multi-LLM Routing
Agno v2 is an open-source framework that enables developers to build a production-ready chat application with intelligent routing. It automatically selects the cheapest LLM capable of handling each user query, optimizing cost and performance.
98× Faster LLM Routing Without a Dedicated GPU: Technical Breakthrough for vLLM Semantic Router
New research presents a three-stage optimization pipeline for the vLLM Semantic Router, achieving 98× speedup and enabling long-context classification on shared GPUs. This solves critical memory and latency bottlenecks for system-level LLM routing.
Beyond Euclidean Distances: How Asymmetric Routing AI Can Optimize Luxury Logistics and Last-Mile Delivery
RADAR introduces a neural framework that solves real-world asymmetric vehicle routing problems, crucial for optimizing luxury goods delivery, store replenishment, and client appointment scheduling in complex urban environments.
Plano AI Proxy Promises 50% Cost Reduction by Intelligently Routing LLM Queries
Plano, an open-source AI proxy powered by the 1.5B parameter Arch-Router model, automatically directs prompts to optimal LLMs based on complexity, potentially halving inference costs while adding orchestration and safety layers.
New MoE Framework Tames User Interest Shifts in Long-Sequence Recommendations
Researchers propose MoS, a model-agnostic MoE approach that handles long user sequences by detecting session hopping – where user interests shift across sessions. The theme-aware routing mechanism filters irrelevant sessions, while multi-scale fusion captures global and local patterns. Results show SOTA on benchmarks with fewer FLOPs than alternatives.
Kimi's Selective Layer Communication Improves Training Efficiency by ~25% with Minimal Inference Overhead
Kimi has developed a method that replaces uniform residual connections with selective information routing between layers in deep AI models. This improves training stability and achieves ~25% better compute efficiency with negligible inference slowdown.
New AI Research: Cluster-Aware Attention-Based Deep RL for Pickup and Delivery Problems
Researchers propose CAADRL, a deep reinforcement learning framework that explicitly models clustered spatial layouts to solve complex pickup and delivery routing problems more efficiently. It matches state-of-the-art performance with significantly lower inference latency.
Beyond Homogenization: How Expert Divergence Learning Unlocks MoE's True Potential
Researchers have developed Expert Divergence Learning, a novel pre-training strategy that combats expert homogenization in Mixture-of-Experts language models. By encouraging functional specialization through domain-aware routing, the method improves performance across benchmarks with minimal computational overhead.
How Claude Code's Upstream Proxy Solves Corporate Network Headaches
Claude Code's CCR feature transparently routes subprocess HTTP traffic through a secure WebSocket tunnel, handling corporate MITM certificates and complex network routing automatically.
The Claude OAuth Workaround Is Dead. Here's How to Cut Your Claude Code API Bill Today
Anthropic killed the OAuth token exploit. Use TeamoRouter's 50% discount and multi-provider routing to slash Claude Code costs without crypto.
How to Configure Claude Code's Sub-Agent Orchestration for Parallel, Sequential, and Background Work
Add routing rules to your CLAUDE.md to make your central AI delegate tasks intelligently—parallel for independent domains, sequential for dependencies, background for research.
vLLM Semantic Router: A New Approach to LLM Orchestration Beyond Simple Benchmarks
The article critiques current LLM routing benchmarks as solving only the easy part, introducing vLLM Semantic Router as a comprehensive solution for production-grade LLM orchestration with semantic understanding.
Switchcraft Router Cuts Agentic AI Inference Cost 84%, Matches Top Model
Switchcraft, a DistilBERT-based model router for agentic tool calling, achieves 82.9% accuracy while cutting inference cost by 84%, saving over $3,600 per million queries.
3 Ways to Switch Claude Code Models Instantly: /model, --flag, and ENV Variables
Anthropic's official guide reveals three methods to switch Claude Code models: /model command, --model flag, and ANTHROPIC_MODEL env variable. Choose the right model for each task.
Free-Claude-Code Proxy Routes Anthropic API to Free NVIDIA NIM Models
A developer released free-claude-code, a proxy that intercepts Claude Code's API calls and routes them to free NVIDIA NIM endpoints, unlocking free access to models like Kimi K2 and GLM 4.7. This bypasses Anthropic's subscription fees and adds remote execution via a Telegram bot.
Alibaba Makes Qwen 3.6 Plus API-Only, Shifts Frontier Model to Paid Access
Alibaba has moved its most capable Qwen 3.6 Plus model to API-only access, while keeping the smaller Qwen 3.6 free. This aligns the company's strategy with OpenAI, Anthropic, and Google's paid frontier model approach.
MiniMax M2.7 Tops Open LLM Leaderboard with 230B Parameter Sparse Model
MiniMax announced its M2.7 model has taken the top spot on the Hugging Face Open LLM Leaderboard. The model uses a sparse mixture-of-experts architecture with 230B total parameters but only activates 10B per token.
How Downgrading to Claude Code 2.1.106 Fixes Model Reasoning Issues
Developers report model reasoning improvements by downgrading to Claude Code 2.1.106 and disabling the Claude Agent feature in global settings.
AI Models Dumber as Compute Shifts to Enterprise, Users Report
Users report noticeable performance degradation in major AI models this month. Analysts suggest providers are shifting computational resources to prioritize enterprise clients over general subscribers.
Zuckerberg: Most Businesses Will Run Custom AI Layers, Not Frontier Models
Mark Zuckerberg predicts most businesses will not own frontier AI models but will build customized operational layers on top of shared models to handle support, sales, and operations. This vision positions foundation models as infrastructure, with value captured in the business-specific layer.
OpenAI Voice Mode Uses Older, Weaker Model, Not GPT-4o
OpenAI's voice mode, which powers its conversational interface, is not powered by the latest GPT-4o model but by a much older and weaker system, creating a disconnect between user perception and technical reality.
Mistral AI Teases 'New Model Tomorrow' in Cryptic Tweet
Mistral AI co-founder Arthur Mensch tweeted 'new model tomorrow!?!', signaling an imminent release. This follows their pattern of rapid, often surprise, model deployments.
DeepSeek-V4 Rumored as 'Whale' Returns, Signaling Major Model Release
DeepSeek's cryptic 'whale' codename has reappeared, strongly hinting at the impending launch of DeepSeek-V4. This follows the company's pattern of using the whale symbol before major model releases.
DeepSeek's R1 Model Triggers Major AI Market Valuation Shifts
Chinese AI startup DeepSeek has released its new large language model R1, causing significant market disruption. The launch reportedly reduced tech giant valuations by approximately one trillion dollars as the model demonstrates competitive capabilities at lower costs.
Gamma 31B Model Reportedly Outperforms Qwen 3.5 397B, Highlighting Efficiency Leap
A developer's social media post claims the Gamma 31B model outperforms the much larger Qwen 3.5 397B. If verified, this would represent a dramatic efficiency gain in large language model scaling.