ELDR (Expert-Locality-Aware Decode Routing) optimizes disaggregated MoE serving. It uses expert signatures from prefill to route decode requests, cutting median TPOT by 5.9–13.9% in vLLM at scale According to @HuggingPapers.
Key facts
- ELDR cuts median TPOT by 5.9–13.9% in vLLM.
- Uses expert signatures from prefill for decode routing.
- No model size, expert count, or hardware disclosed.
- No ablation studies or comparison to other routing methods.
- Targets disaggregated MoE serving systems.
Mixture-of-Experts models suffer a well-known pain point in production: prefill and decode phases have asymmetric compute and memory demands, making disaggregated serving attractive but introducing routing inefficiency. ELDR attacks that problem by exploiting a signal that prior work largely ignored — the expert-activation patterns produced during prefill can serve as a signature for routing decode tokens to the same set of experts, reducing cross-node communication and cache misses.
The paper reports a median time-per-output-token (TPOT) reduction of 5.9% to 13.9% when integrated with vLLM at scale. That range is meaningful — it suggests the gain depends on workload characteristics, batch size, and the degree of expert locality in the model. But the source [@HuggingPapers] does not disclose the number of experts, model size, or hardware configuration used in the experiments, making it difficult to assess generalizability. No ablation studies are provided, and there is no comparison to alternative routing strategies such as top-k gating or hash-based affinity routing.
The unique take here is that ELDR treats prefill-to-decode locality as a first-class signal rather than a byproduct. Most disaggregated serving systems (e.g., Splitwise, DistServe) focus on balancing load across prefill and decode instances, but they route decode tokens independently of which experts were active during prefill. ELDR closes that feedback loop. The 5.9–13.9% TPOT improvement is modest — not a breakthrough — but it addresses a real operational cost for any team running MoE models at scale. The question is whether the overhead of computing and storing expert signatures during prefill erodes the gain in throughput or memory.
What to watch

Watch for a full paper release on arXiv with model scale, expert count, hardware config, and ablation studies. The community needs to see whether ELDR's gain holds at 8×220B MoE scale and whether the signature-computation overhead is sub-linear in batch size.








