Xiaomi's Embodied Intelligence Team has released OneVL, a novel vision-language model that introduces latent Chain-of-Thought (CoT) reasoning for autonomous driving tasks. According to the team's announcement, this is the first latent CoT method to surpass explicit CoT performance while maintaining answer-only latency—a critical breakthrough for real-time driving applications.
The model achieves state-of-the-art (SOTA) results on four key autonomous driving benchmarks: NAVSIM, ROADWork, Impromptu, and APR1. Unlike traditional CoT approaches that generate explicit reasoning steps (increasing latency), OneVL performs reasoning in a latent space, delivering both superior accuracy and faster inference.
Key Takeaways
- Xiaomi's Embodied Intelligence Team released OneVL, a vision-language model using latent Chain-of-Thought reasoning.
- It achieves state-of-the-art results on four autonomous driving benchmarks without the latency penalty of explicit reasoning steps.
What Xiaomi Built: Dual-Decoder Architecture for Latent Reasoning
![]()
OneVL employs a dual visual and language decoder architecture specifically designed for embodied AI tasks. The core innovation lies in its latent Chain-of-Thought mechanism. Instead of generating textual reasoning steps like "First I see a pedestrian, then I check the traffic light...", the model performs equivalent reasoning operations within its internal representations.
This approach addresses the fundamental latency problem in autonomous driving: explicit CoT can add hundreds of milliseconds to inference time as the model generates sequential reasoning text. OneVL's latent reasoning happens in parallel within the model's forward pass, maintaining the speed of simple answer-only models while achieving the reasoning benefits of CoT.
Key Results: SOTA Across Four Benchmarks
The team reports OneVL achieves state-of-the-art performance on:
NAVSIM Navigation simulation SOTA Not specified ROADWork Road scene understanding SOTA Not specified Impromptu Unplanned driving scenarios SOTA Not specified APR1 Action prediction & reasoning SOTA Not specifiedNote: The announcement didn't provide specific numerical scores, only stating SOTA status.
Crucially, the team emphasizes that these results come at answer-only latency—meaning the model responds as quickly as models that simply output answers without any reasoning process. This latency advantage makes OneVL particularly suitable for real-time autonomous driving applications where milliseconds matter.
How Latent CoT Works: Reasoning Without Text Generation
Traditional Chain-of-Thought prompting works by having language models generate explicit reasoning steps before producing a final answer. For example:
Q: Should the car slow down?
A: Let's think step by step. I see a pedestrian crossing 50m ahead. The traffic light is green but turning yellow. The speed limit is 60km/h. Therefore, the car should begin slowing down.
OneVL's latent CoT performs similar reasoning but entirely within the model's internal representations. The dual-decoder architecture allows visual and linguistic information to interact in a structured way that mimics step-by-step reasoning without generating intermediate text.
The visual decoder processes scene information while the language decoder handles task instructions and produces final decisions. Between them, a latent reasoning module performs what the researchers describe as "implicit step-by-step reasoning"—essentially running through logical steps in compressed form.
Why This Matters for Autonomous Driving
Autonomous driving systems face a fundamental tension between reasoning capability and latency. Complex reasoning improves safety and decision-making but typically slows down response times. OneVL attempts to break this trade-off by making reasoning faster through latent representations.
For practical deployment, this means:
- Faster reaction times in critical situations
- More sophisticated reasoning without sacrificing real-time performance
- Reduced computational overhead compared to explicit CoT approaches
The four benchmarks where OneVL achieves SOTA cover important aspects of autonomous driving:
- NAVSIM: Navigation in simulated environments
- ROADWork: Understanding complex road scenes and infrastructure
- Impromptu: Handling unexpected scenarios
- APR1: Action prediction and reasoning
Technical Implementation & Availability

Based on the announcement, OneVL appears to be a research release rather than an immediately deployable product. The team has shared their findings via a paper (linked in the announcement) but hasn't specified plans for integration into Xiaomi's automotive products.
The dual-decoder architecture suggests the model was trained on paired vision-language data specifically focused on driving scenarios. The latent CoT mechanism likely required specialized training techniques to encourage the model to develop structured internal reasoning without explicit supervision.
Limitations and Open Questions
The announcement leaves several important questions unanswered:
- Specific benchmark numbers: Without numerical scores, it's difficult to gauge the magnitude of improvement over previous methods.
- Ablation studies: How much does latent CoT contribute versus other architectural choices?
- Generalization: Does the latent reasoning approach work beyond autonomous driving tasks?
- Training details: What datasets and compute resources were required?
Additionally, while latent reasoning reduces latency, it may make the model's decision process less interpretable—a significant concern for safety-critical applications like autonomous driving.
gentic.news Analysis
Xiaomi's entry into advanced autonomous driving AI represents a significant escalation in the automotive AI race. While companies like Tesla, Waymo, and Cruise have dominated autonomous driving research, consumer electronics and smartphone manufacturers like Xiaomi are now bringing substantial AI research capabilities to the space. This follows Xiaomi's strategic pivot into electric vehicles, with their first car model launching in 2025.
The latent CoT approach is particularly noteworthy because it addresses one of the most practical barriers to deploying sophisticated reasoning in real-time systems. Most CoT research has focused on improving accuracy with little attention to latency implications. OneVL represents one of the first attempts to optimize the efficiency of reasoning, not just its effectiveness.
This development aligns with a broader trend we've covered at gentic.news: the migration of advanced AI techniques from pure research to real-time embedded systems. In November 2025, we reported on NVIDIA's Drive Co-Pilot using similar compressed reasoning techniques for in-vehicle AI. OneVL appears to take this further by making the reasoning entirely latent rather than just compressed.
For practitioners, the key insight is that reasoning efficiency is becoming a first-class research problem. As AI systems move from chatbots to robots and vehicles, the computational cost of sophisticated reasoning can't be an afterthought. OneVL's architecture suggests that specialized models with domain-specific reasoning modules may outperform general-purpose LLMs on latency-critical tasks.
The dual-decoder approach also hints at a possible future direction: task-specific reasoning architectures rather than one-size-fits-all models. If latent CoT proves broadly effective, we might see similar architectures for robotics, industrial automation, and other real-time AI applications.
Frequently Asked Questions
What is latent Chain-of-Thought?
Latent Chain-of-Thought is a reasoning approach where AI models perform step-by-step logical reasoning within their internal representations rather than generating explicit textual reasoning steps. This reduces latency while maintaining the benefits of structured reasoning.
How does OneVL compare to Tesla's FSD?
While both systems target autonomous driving, they use different approaches. Tesla's Full Self-Driving relies primarily on computer vision and neural networks trained on massive real-world driving data. OneVL is a vision-language model that incorporates explicit reasoning capabilities through its latent CoT mechanism. The systems aren't directly comparable as OneVL is a research model while FSD is a deployed system.
Can latent reasoning be verified for safety?
This is a significant challenge. Because latent reasoning happens internally without explicit steps, it's harder to audit and verify than explicit CoT. Researchers are developing techniques to interpret latent reasoning, but this remains an active area of investigation, especially for safety-critical applications.
Will OneVL be used in Xiaomi's electric vehicles?
The announcement doesn't specify deployment plans. Typically, research models like OneVL serve as proofs-of-concept that may influence future product development. Elements of the architecture or training approach might eventually appear in Xiaomi's automotive AI systems, but direct integration would require substantial engineering for production readiness.









