speculative decoding

24 articles about speculative decoding in AI news

PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100

PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.

Apr 23, 202678% relevant

DFlash Brings Speculative Decoding to Apple Silicon via MLX

DFlash, a new open-source project, implements speculative decoding for large language models on Apple Silicon using the MLX framework, reportedly delivering up to 2.5x speedup on an M5 Max.

Apr 13, 202685% relevant

NVIDIA's Kimi-K2.5 Eagle Head: Supercharging Moonshot's Reasoning with Speculative Decoding

NVIDIA has released the Kimi-K2.5 Eagle head on Hugging Face, implementing Eagle-3 speculative decoding to dramatically accelerate inference for Moonshot's reasoning models. This breakthrough promises blazing-fast performance while maintaining accuracy.

Mar 12, 202689% relevant

Claude Code's Secret Efficiency Hack

Claude Code leverages speculative decoding to reduce LLM energy use by 100x. Learn how this built-in optimization makes your coding faster and cheaper.

Apr 23, 202677% relevant

Nebius AI's LK Losses: A Breakthrough in Making Large Language Models Faster and More Efficient

Nebius AI has introduced LK Losses, a novel training objective that directly optimizes acceptance rates in speculative decoding. This approach achieves 8-10% efficiency gains over traditional methods, potentially revolutionizing how large language models are deployed.

Mar 3, 202685% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

Apr 20, 202685% relevant

Dflash with Continuous Batch Inference Teased for Draft Models

A developer teased the upcoming release of 'Dflash' with continuous batch inference, targeting current text-only draft models used in speculative execution to speed up LLM inference.

Apr 16, 202685% relevant

Sam Altman: AI inference costs dropped 1000x from o1 to GPT-5.4

Sam Altman stated AI inference costs for solving a fixed hard problem dropped ~1000x from o1 to GPT-5.4 in ~16 months, crediting cross-layer engineering optimizations, not a single breakthrough.

Apr 22, 202685% relevant

TACO Framework Cuts Agent Token Overhead 10% via Self-Evolving Compression

Researchers introduced TACO, a framework that enables terminal agents to automatically discover and refine context compression rules from their own interaction trajectories. This approach cuts token overhead by approximately 10% on benchmarks like TerminalBench and SWE-Bench Lite while preserving task accuracy.

Apr 22, 202685% relevant

SemiAnalysis: NVIDIA's Customer Data Drives Disaggregated Inference, LPU Surpasses GPU

SemiAnalysis states NVIDIA's direct customer feedback is leading the industry toward disaggregated inference architectures. In this model, specialized LPUs can outperform GPUs for specific pipeline tasks.

Apr 22, 202685% relevant

Principal Engineer: Claude Code Rushes, Codex Deliberate; Guardrails Are Key

A senior engineer with 100 hours in Claude Code and 20 in Codex reports Claude often rushes to patch, while Codex is more deliberate. The real product is the guardrail system—docs and review loops—not the AI itself.

Apr 17, 202685% relevant

MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

Apr 16, 202695% relevant

Compute Constraints Create Double Bind for AI Growth: Ethan Mollick

Ethan Mollick highlights a critical industry bottleneck: compute scarcity forces a trade-off between raising prices/rationing current models and limiting future model training, creating a growth double bind.

Apr 15, 202685% relevant

AI Models Dumber as Compute Shifts to Enterprise, Users Report

Users report noticeable performance degradation in major AI models this month. Analysts suggest providers are shifting computational resources to prioritize enterprise clients over general subscribers.

Apr 13, 202685% relevant

OpenAI Forecasts $121B in AI Hardware Costs for 2028

OpenAI is forecasting its own AI research hardware costs will reach $121 billion in 2028, according to a WSJ report. This figure highlights the extreme capital intensity required to compete at the frontier of AI.

Apr 12, 202685% relevant

InCoder-32B-Thinking Hits 81.3% on LiveCodeBench, Trained on Chip & Kernel Traces

InCoder-32B-Thinking, a 32B parameter model trained on execution traces from chip design, GPU kernels, and embedded systems, scores 81.3% on LiveCodeBench V5 and an 84% compile pass rate on CAD-Coder.

Apr 11, 202692% relevant

MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes

Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.

Apr 9, 202685% relevant

OpenAI President Teases 'Spud' Model, Two Years of Research

OpenAI President Greg Brockman briefly mentioned an upcoming model codenamed 'Spud', stating it represents 'two years worth of research that is coming to fruition.' No technical details or release timeline were provided.

Apr 5, 202687% relevant

Anthropic's Claude Mythos Compute Needs Delay Release, 'Spud' Likely First

Anthropic's leaked internal note reveals its next flagship model, Claude Mythos, is too computationally expensive for general release. The company states it needs to become 'much more efficient,' likely delaying Mythos and prioritizing the 'Spud' model.

Apr 5, 202685% relevant

Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%

Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.

Mar 28, 202697% relevant

Anthropic Rumored to Develop 'Mythos' and 'Capybara' Models, With Mythos Positioned as Premium Tier Above Claude 3.5 Opus

Anthropic is reportedly preparing new AI models codenamed 'Mythos' and 'Capybara,' with Mythos positioned as a premium tier above Claude 3.5 Opus. The rumored model is described as extremely expensive to run, suggesting a larger, more computationally intensive system.

Mar 27, 202695% relevant

TurboQuant Ported to Apple MLX, Claims 75% Memory Reduction with Minimal Performance Loss

Developer Prince Canuma has successfully ported the TurboQuant quantization method to Apple's MLX framework, reporting a 75% reduction in memory usage with nearly no performance degradation for on-device AI models.

Mar 26, 202685% relevant

Text-to-Video Model Achieves Sub-100ms Prompt-to-Output Latency

An AI researcher reports a text-to-video model generating outputs in under 100 milliseconds. This represents a 300x speed improvement over current models that typically take 30+ seconds.

Mar 19, 202685% relevant

Google's Gemma 4 Emerges: The Next Generation of Open AI Models

Google has announced the upcoming release of Gemma 4, the next iteration of its open-source AI model family. This development signals Google's continued commitment to accessible AI technology and intensified competition in the open model space.

Mar 9, 202685% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety