speculative decoding
24 articles about speculative decoding in AI news
PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100
PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.
DFlash Brings Speculative Decoding to Apple Silicon via MLX
DFlash, a new open-source project, implements speculative decoding for large language models on Apple Silicon using the MLX framework, reportedly delivering up to 2.5x speedup on an M5 Max.
NVIDIA's Kimi-K2.5 Eagle Head: Supercharging Moonshot's Reasoning with Speculative Decoding
NVIDIA has released the Kimi-K2.5 Eagle head on Hugging Face, implementing Eagle-3 speculative decoding to dramatically accelerate inference for Moonshot's reasoning models. This breakthrough promises blazing-fast performance while maintaining accuracy.
Claude Code's Secret Efficiency Hack
Claude Code leverages speculative decoding to reduce LLM energy use by 100x. Learn how this built-in optimization makes your coding faster and cheaper.
Nebius AI's LK Losses: A Breakthrough in Making Large Language Models Faster and More Efficient
Nebius AI has introduced LK Losses, a novel training objective that directly optimizes acceptance rates in speculative decoding. This approach achieves 8-10% efficiency gains over traditional methods, potentially revolutionizing how large language models are deployed.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
Dflash with Continuous Batch Inference Teased for Draft Models
A developer teased the upcoming release of 'Dflash' with continuous batch inference, targeting current text-only draft models used in speculative execution to speed up LLM inference.
Sam Altman: AI inference costs dropped 1000x from o1 to GPT-5.4
Sam Altman stated AI inference costs for solving a fixed hard problem dropped ~1000x from o1 to GPT-5.4 in ~16 months, crediting cross-layer engineering optimizations, not a single breakthrough.
TACO Framework Cuts Agent Token Overhead 10% via Self-Evolving Compression
Researchers introduced TACO, a framework that enables terminal agents to automatically discover and refine context compression rules from their own interaction trajectories. This approach cuts token overhead by approximately 10% on benchmarks like TerminalBench and SWE-Bench Lite while preserving task accuracy.
SemiAnalysis: NVIDIA's Customer Data Drives Disaggregated Inference, LPU Surpasses GPU
SemiAnalysis states NVIDIA's direct customer feedback is leading the industry toward disaggregated inference architectures. In this model, specialized LPUs can outperform GPUs for specific pipeline tasks.
Principal Engineer: Claude Code Rushes, Codex Deliberate; Guardrails Are Key
A senior engineer with 100 hours in Claude Code and 20 in Codex reports Claude often rushes to patch, while Codex is more deliberate. The real product is the guardrail system—docs and review loops—not the AI itself.
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
Compute Constraints Create Double Bind for AI Growth: Ethan Mollick
Ethan Mollick highlights a critical industry bottleneck: compute scarcity forces a trade-off between raising prices/rationing current models and limiting future model training, creating a growth double bind.
AI Models Dumber as Compute Shifts to Enterprise, Users Report
Users report noticeable performance degradation in major AI models this month. Analysts suggest providers are shifting computational resources to prioritize enterprise clients over general subscribers.
OpenAI Forecasts $121B in AI Hardware Costs for 2028
OpenAI is forecasting its own AI research hardware costs will reach $121 billion in 2028, according to a WSJ report. This figure highlights the extreme capital intensity required to compete at the frontier of AI.
InCoder-32B-Thinking Hits 81.3% on LiveCodeBench, Trained on Chip & Kernel Traces
InCoder-32B-Thinking, a 32B parameter model trained on execution traces from chip design, GPU kernels, and embedded systems, scores 81.3% on LiveCodeBench V5 and an 84% compile pass rate on CAD-Coder.
MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes
Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.
OpenAI President Teases 'Spud' Model, Two Years of Research
OpenAI President Greg Brockman briefly mentioned an upcoming model codenamed 'Spud', stating it represents 'two years worth of research that is coming to fruition.' No technical details or release timeline were provided.
Anthropic's Claude Mythos Compute Needs Delay Release, 'Spud' Likely First
Anthropic's leaked internal note reveals its next flagship model, Claude Mythos, is too computationally expensive for general release. The company states it needs to become 'much more efficient,' likely delaying Mythos and prioritizing the 'Spud' model.
Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%
Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.
Anthropic Rumored to Develop 'Mythos' and 'Capybara' Models, With Mythos Positioned as Premium Tier Above Claude 3.5 Opus
Anthropic is reportedly preparing new AI models codenamed 'Mythos' and 'Capybara,' with Mythos positioned as a premium tier above Claude 3.5 Opus. The rumored model is described as extremely expensive to run, suggesting a larger, more computationally intensive system.
TurboQuant Ported to Apple MLX, Claims 75% Memory Reduction with Minimal Performance Loss
Developer Prince Canuma has successfully ported the TurboQuant quantization method to Apple's MLX framework, reporting a 75% reduction in memory usage with nearly no performance degradation for on-device AI models.
Text-to-Video Model Achieves Sub-100ms Prompt-to-Output Latency
An AI researcher reports a text-to-video model generating outputs in under 100 milliseconds. This represents a 300x speed improvement over current models that typically take 30+ seconds.
Google's Gemma 4 Emerges: The Next Generation of Open AI Models
Google has announced the upcoming release of Gemma 4, the next iteration of its open-source AI model family. This development signals Google's continued commitment to accessible AI technology and intensified competition in the open model space.