Nvidia unveiled Cosmos 3, a single model that unifies understanding, simulation, and action across physical AI tasks. The model treats action as a fundamental token, collapsing what previously required separate vision-language models, physics simulators, and control policies into one architecture.
Key facts
- Cosmos 3 unifies understanding, simulation, and action in one model.
- Action is treated as a fundamental token in the architecture.
- Targets robotics, autonomous vehicles, and industrial automation.
- No model size, training compute, or benchmark scores disclosed.
- Competes with Google RT-2 and Physical Intelligence π0.
Nvidia's Cosmos 3, announced via X, represents a structural shift in physical AI: one model that can perceive a scene, simulate possible futures, and execute actions — all within a single transformer. According to @rohanpaul_ai, the model "treats action as a fundamental token," a design choice that collapses the traditional pipeline of separate vision, simulation, and control modules.
Why Action-as-Token Matters
Most physical AI systems chain a vision-language model (VLM) for perception, a physics simulator for prediction, and a separate policy network for control. Cosmos 3 consolidates these into one autoregressive model that predicts action tokens directly from visual and textual inputs. This mirrors recent trends in embodied AI — notably Google's RT-2 (2023) and Physical Intelligence's π0 (2024) — but Nvidia claims Cosmos 3 is the first to unify understanding, simulation, and action in a single training run.
What We Don't Know
Nvidia did not disclose model size, training compute, or benchmark scores. The announcement lacks quantitative comparisons to existing baselines — a notable omission for a company that typically publishes detailed technical reports. [Per the announcement], Cosmos 3 targets robotics, autonomous vehicles, and industrial automation, but no specific deployment timelines or partner integrations were named.
Competitive Landscape
The move puts Nvidia in direct competition with Google DeepMind's Gemini Robotics and Tesla's Optimus control stack, both of which are pursuing similar unification. Nvidia's advantage is its existing hardware ecosystem — Cosmos 3 likely runs on Blackwell GPUs and leverages the Omniverse simulation platform for training data generation.
What to watch
Watch for Nvidia's GTC 2026 keynote (March) where a technical paper detailing Cosmos 3's architecture, training data, and benchmark results is expected. Also track whether the model integrates with the Isaac robotics platform and any early-adopter deployments in autonomous trucking or warehouse automation.









