speculative decoding
30 articles about speculative decoding in AI news
NVIDIA NeMo RL Speculative Decoding: 1.8× Rollout Speed at 8B
NVIDIA's NeMo RL speculative decoding achieves 1.8× rollout speedup at 8B and projects 2.5× at 235B, cutting RL training time by over half.
PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100
PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.
DFlash Brings Speculative Decoding to Apple Silicon via MLX
DFlash, a new open-source project, implements speculative decoding for large language models on Apple Silicon using the MLX framework, reportedly delivering up to 2.5x speedup on an M5 Max.
NVIDIA's Kimi-K2.5 Eagle Head: Supercharging Moonshot's Reasoning with Speculative Decoding
NVIDIA has released the Kimi-K2.5 Eagle head on Hugging Face, implementing Eagle-3 speculative decoding to dramatically accelerate inference for Moonshot's reasoning models. This breakthrough promises blazing-fast performance while maintaining accuracy.
mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon
mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.
Paper Details Full-Stack MFM Acceleration: Quant, Spec Decode, HW Co-Design
A research paper details a full-stack approach for accelerating multimodal foundation models, combining hierarchy-aware mixed-precision quantization, structural pruning, speculative decoding, model cascading, and a specialized hardware accelerator. Demonstrated on medical and code generation tasks.
Claude Code's Secret Efficiency Hack
Claude Code leverages speculative decoding to reduce LLM energy use by 100x. Learn how this built-in optimization makes your coding faster and cheaper.
Nebius AI's LK Losses: A Breakthrough in Making Large Language Models Faster and More Efficient
Nebius AI has introduced LK Losses, a novel training objective that directly optimizes acceptance rates in speculative decoding. This approach achieves 8-10% efficiency gains over traditional methods, potentially revolutionizing how large language models are deployed.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
Dflash with Continuous Batch Inference Teased for Draft Models
A developer teased the upcoming release of 'Dflash' with continuous batch inference, targeting current text-only draft models used in speculative execution to speed up LLM inference.
MiniMax M3: Sparse Attention, 1M Context, Multimodal via Together
MiniMax M3 uses sparse attention for 1M context and multimodality, with Together AI serving fast inference.
Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics
SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.
Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots
Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.
Qwen 3.6 27B Hits 34 tok/s on M5 Max MacBook Pro
Qwen 3.6 27B hits 34 tok/s on M5 Max MacBook Pro with 90% acceptance rate, per @rohanpaul_ai. Shows viable local LLM inference on Apple Silicon.
LASAR Cuts Latent Reasoning Steps in Half for GenRec at 20x Speedup Over CoT
LASAR nearly halves latent reasoning steps and achieves 20x speedup over explicit CoT in generative recommendation, outperforming baselines on three datasets.
Google Gemma 4: 3x Faster Inference with MTP Drafters
Google's Gemma 4 claims up to 3x faster inference via MTP drafters, but released no benchmark numbers or architectural details.
RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by
Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.
Sam Altman: AI inference costs dropped 1000x from o1 to GPT-5.4
Sam Altman stated AI inference costs for solving a fixed hard problem dropped ~1000x from o1 to GPT-5.4 in ~16 months, crediting cross-layer engineering optimizations, not a single breakthrough.
TACO Framework Cuts Agent Token Overhead 10% via Self-Evolving Compression
Researchers introduced TACO, a framework that enables terminal agents to automatically discover and refine context compression rules from their own interaction trajectories. This approach cuts token overhead by approximately 10% on benchmarks like TerminalBench and SWE-Bench Lite while preserving task accuracy.
SemiAnalysis: NVIDIA's Customer Data Drives Disaggregated Inference, LPU Surpasses GPU
SemiAnalysis states NVIDIA's direct customer feedback is leading the industry toward disaggregated inference architectures. In this model, specialized LPUs can outperform GPUs for specific pipeline tasks.
Principal Engineer: Claude Code Rushes, Codex Deliberate; Guardrails Are Key
A senior engineer with 100 hours in Claude Code and 20 in Codex reports Claude often rushes to patch, while Codex is more deliberate. The real product is the guardrail system—docs and review loops—not the AI itself.
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
Compute Constraints Create Double Bind for AI Growth: Ethan Mollick
Ethan Mollick highlights a critical industry bottleneck: compute scarcity forces a trade-off between raising prices/rationing current models and limiting future model training, creating a growth double bind.
AI Models Dumber as Compute Shifts to Enterprise, Users Report
Users report noticeable performance degradation in major AI models this month. Analysts suggest providers are shifting computational resources to prioritize enterprise clients over general subscribers.
OpenAI Forecasts $121B in AI Hardware Costs for 2028
OpenAI is forecasting its own AI research hardware costs will reach $121 billion in 2028, according to a WSJ report. This figure highlights the extreme capital intensity required to compete at the frontier of AI.
InCoder-32B-Thinking Hits 81.3% on LiveCodeBench, Trained on Chip & Kernel Traces
InCoder-32B-Thinking, a 32B parameter model trained on execution traces from chip design, GPU kernels, and embedded systems, scores 81.3% on LiveCodeBench V5 and an 84% compile pass rate on CAD-Coder.
MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes
Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.
OpenAI President Teases 'Spud' Model, Two Years of Research
OpenAI President Greg Brockman briefly mentioned an upcoming model codenamed 'Spud', stating it represents 'two years worth of research that is coming to fruition.' No technical details or release timeline were provided.
Anthropic's Claude Mythos Compute Needs Delay Release, 'Spud' Likely First
Anthropic's leaked internal note reveals its next flagship model, Claude Mythos, is too computationally expensive for general release. The company states it needs to become 'much more efficient,' likely delaying Mythos and prioritizing the 'Spud' model.
Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%
Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.