Switchcraft, a DistilBERT-based model router, cuts agentic AI inference cost by 84% while matching top-model accuracy at 82.9%. The arXiv paper, submitted May 8, 2026, targets the overlooked problem of tool-calling cost in agentic systems.
Key facts
- Switchcraft reduces inference cost by 84%.
- Saves over $3,600 per million queries.
- Matches top model accuracy at 82.9% on 5 benchmarks.
- DistilBERT-based classifier deployed under latency budget.
- Larger models don't consistently outperform smaller ones on tool tasks.
Agentic AI systems that invoke external tools are powerful but expensive, leading developers to default to large models and overspend inference budgets. Existing model routers are designed for chat completion, not tool use — a gap that Switchcraft addresses.
Switchcraft operates inline, selecting the lowest-cost model subject to correctness. The authors built an evaluation framework on five function-calling benchmarks and trained a DistilBERT-based classifier, deployed under a latency budget. The router achieves 82.9% accuracy — matching or exceeding the best individual model — while reducing inference cost by 84%, saving over $3,600 per million queries [per the arXiv preprint].
Unique take: The paper finds that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. This contradicts the prevailing assumption that bigger models are always better for agentic tasks, and suggests that model selection for tool calling requires a dedicated router rather than a general-purpose one.
Switchcraft builds on prior work in model routing for chat completion, but is — to the authors' knowledge — the first router optimized for agentic tool calling. The DistilBERT classifier adds minimal latency, making it suitable for inline deployment in production agentic pipelines.
Broader context: This paper joins a wave of research focused on cost-efficient agentic AI. Earlier this month, SAE-based probes for predicting agent tool failures were posted to arXiv [2026-05-07]. Switchcraft complements that work by addressing the cost side of the equation rather than failure prediction.
What to watch
Watch for open-source release of Switchcraft's DistilBERT classifier and evaluation framework, which would enable community replication and extension to more tool-calling benchmarks. Also watch for adoption in production agentic systems like LangChain or AutoGPT.









