Tuowei Wang et al. published llada.cpp on arXiv, the first NPU-aware inference framework for diffusion LLMs on smartphones. It reduces LLaDA-8B generation latency by 17x-42x over a CPU baseline while preserving generation quality.
Key facts
- 17x-42x speedup on OnePlus Ace5 Pro with SM8750 SoC
- First NPU-aware dLLM inference framework (llada.cpp)
- Three techniques: speculative decoding, dual-path revision, swap runtime
- Evaluated on LLaDA-8B for 128-token outputs
- Published on arXiv 2026-06-11
Diffusion large language models (dLLMs) denoise multiple tokens in parallel, promising faster generation than autoregressive models—but repeated denoising is computationally heavy for smartphones. Mobile NPUs offer high-throughput dense matrix computation, but three problems block efficient dLLM deployment: token commitment shrinks per-block workloads, token revision complicates KV cache reuse, and limited NPU-visible address space forces costly remapping and data transfers.
According to the arXiv preprint, llada.cpp solves these with three techniques:
1. Multi-Block Speculative Decoding — In late-stage current-block decoding, the workload shrinks because most tokens are already committed. llada.cpp fills that gap by speculatively decoding future-block tokens, keeping the NPU fully utilized.
2. Dual-Path Progressive Revision — Tokens committed early might still need revision. The framework keeps them revisable until stable, and refreshes unstable tokens via a CPU-side path that doesn't stall dense NPU execution.
3. Swap-Optimized Memory Runtime — It compacts NPU-visible address layouts and overlaps data staging with NPU computation, slashing remapping and transfer overhead.
The authors evaluated llada.cpp on the OnePlus Ace5 Pro with Qualcomm's SM8750 SoC, achieving end-to-end speedups of 17x-42x for 128-token outputs compared to a CPU baseline with prefix KV cache reuse. Generation quality is preserved; the paper reports no significant degradation in perplexity or downstream task scores.
Why this matters for on-device AI
This work directly addresses a structural bottleneck in mobile inference: current NPU use is limited to the prefill phase (prompt ingestion, first-token generation), as noted in Reddit discussions. llada.cpp extends NPU acceleration to the entire decode loop, including the challenging revision steps that previously forced fallback to CPU or GPU. The 17-42x range means a model that took seconds can now run in hundreds of milliseconds—a threshold that makes real-time on-device generation viable.
The paper also highlights a broader trend: as dLLMs gain traction (e.g., LLaDA, MDLM), inference frameworks must evolve to match their unique compute patterns. Autoregressive optimizations (speculative decoding, KV cache quantization) don't directly transfer; llada.cpp's multi-block speculative decoding is a novel adaptation.
Limitations
The evaluation is limited to one SoC (SM8750) and one model (LLaDA-8B). Generalization to other NPU architectures (Apple Neural Engine, MediaTek APU) and larger dLLMs remains unproven. The paper does not report power consumption figures, which are critical for mobile deployment. The code is not yet publicly released, though the authors plan to open-source it.
What to watch
Watch for open-source release of llada.cpp code and for follow-up evaluations on Apple Neural Engine and MediaTek APU. Also watch whether LLaDA or other dLLMs gain adoption on mobile—if they do, llada.cpp's approach could become standard for on-device inference.

Source: arxiv.org








