dMoE cuts uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of baseline performance. The block-level routing method for diffusion LLMs also reduces memory by up to 80% and delivers up to 1.66× speedup.
Key facts
- Active experts reduced from 69.5 to 14.6.
- 99.11% of baseline performance retained.
- Memory cut by up to 80%.
- Speedup of up to 1.66×.
- Method: block-level routing for diffusion LLMs.
A new routing method called dMoE targets the inefficiency of diffusion-based large language models that use mixture-of-experts (MoE) architectures. Standard MoE routing activates many experts per token — 69.5 on average for diffusion LLMs — wasting compute and memory. dMoE introduces block-level routing: instead of routing each token independently, it groups tokens by block and assigns a shared set of experts, drastically reducing the number of unique experts activated per forward pass.
The results, shared on X by @HuggingPapers, show dMoE retains 99.11% of the original model's performance while cutting uniquely activated experts by 79% — from 69.5 down to 14.6. Memory consumption drops up to 80%, and inference speed increases by up to 1.66×. The paper's arXiv link was included in the post, though full training and evaluation details (e.g., model size, dataset, baseline architecture) were not disclosed in the tweet.
Why this matters for deployment

MoE models are popular for scaling LLMs without proportional compute increases, but they suffer from memory overhead due to storing all expert weights. dMoE's block-level routing reduces the number of experts that need to be loaded into memory per forward pass, directly lowering memory bandwidth requirements — a key bottleneck for inference on GPUs. The 1.66× speedup suggests dMoE could make diffusion LLMs more practical for latency-sensitive applications.
The technique is reminiscent of expert pruning and conditional computation methods from the transformer literature (e.g., Shazeer et al. 2017, Fedus et al. 2022), but applied specifically to diffusion LLMs, which have different token-level dynamics than autoregressive models. Whether dMoE generalizes across model scales and tasks remains unverified from the tweet alone.
What to watch
Watch for the full arXiv paper release with ablation studies across model scales (e.g., 7B, 13B, 70B) and task benchmarks (MMLU, GSM8K). Also track whether dMoE is adopted by teams working on diffusion LLMs for code generation or reasoning.









