Microsoft Research's Mirage stores 3D scenes directly as latent tokens, skipping the costly RGB render-and-reencode loop. It delivers up to 10.57x faster video generation and 55x lower memory use.
Key facts
- Up to 10.57x faster video generation vs. baselines
- 55x lower memory consumption
- State-of-the-art consistency on WorldScore
- Bypasses RGB render-and-reencode loop
- Introduced by Microsoft Research
Microsoft Research has introduced Mirage, a latent spatial memory that stores 3D scenes directly as latent tokens, bypassing the traditional RGB rendering pipeline. According to @HuggingPapers, this eliminates the most computationally expensive step in 3D-aware video synthesis: rendering full-resolution RGB frames and then re-encoding them into latent space for downstream diffusion models.
How it works
Mirage's latent spatial memory is a differentiable data structure that holds a compressed 3D representation of the scene, updated incrementally as the camera moves. Instead of producing RGB images and then encoding them into latent space (the typical approach used by models like MVSplat and 3DGS-Enhancer), Mirage directly outputs latent tokens that feed into a video diffusion model.
The key innovation is avoiding the RGB render-and-reencode loop, which accounts for a significant fraction of total inference cost in prior systems. The tweet reports up to 10.57x faster generation and 55x lower memory consumption compared to baselines.
Performance claims
Mirage achieves state-of-the-art consistency scores on the WorldScore benchmark, which measures multi-view coherence of generated videos. The exact consistency delta over prior methods was not disclosed in the tweet, nor were full benchmark tables provided. The system's ability to maintain 3D consistency over long camera trajectories without hallucinating geometry is the primary claimed advance.
Context and open questions
The work arrives amid growing interest in latent-space rendering for 3D-aware generation. Prior approaches like Neural Radiance Fields (NeRF, Mildenhall et al. 2020) and 3D Gaussian Splatting (Kerbl et al. 2023) rely on explicit RGB rendering, which becomes a bottleneck at high resolutions or long sequences. Mirage's latent approach could be complementary to these methods, but the paper—not yet linked in the tweet—would need to clarify whether the latent spatial memory is learned per-scene or generalizes across scenes.
Microsoft has not released code or a full preprint. The claims of 10.57x speedup and 55x memory reduction are stated as aggregate results against unspecified baselines; the exact hardware, model architectures, and evaluation protocols are not detailed in the tweet.
What to watch
Watch for the full arXiv preprint and code release from Microsoft Research. Key questions: does Mirage generalize to unseen scenes without per-scene optimization, and how does it compare against 3D Gaussian Splatting on standard novel-view synthesis benchmarks like LLFF and Mip-NeRF 360?









