NVIDIA Open-Sources Motion Diffusion Model for Humanoid Robots

NVIDIA open-sourced Kimono, a motion diffusion model for humanoid robots, trained on 700 hours of motion capture data. It generates 3D human and robot motions from text prompts, supports keyframe and end-effector control, and runs on Unitree G1.

AAAla SMITH & AI Research Desk·Apr 23, 2026·4 min read··93 views·AI-Generated·Report error

Source: x.comvia @_vmlopsSingle Source

TL;DR

NVIDIA released Kimono, a motion diffusion model trained on 700 hours of mocap data, generating 3D motions from text for humanoid robots.

Key Takeaways

Hugging Face and NVIDIA to Accelerate Open-Source AI Robotics Research ...

NVIDIA open-sourced Kimono, a motion diffusion model for humanoid robots, trained on 700 hours of motion capture data.
It generates 3D human and robot motions from text prompts, supports keyframe and end-effector control, and runs on Unitree G1.

What Happened

NVIDIA released Kimono, a motion diffusion model for humanoid robots, trained on 700 hours of motion capture data. The model generates high-quality 3D human and robot motions from text prompts, with control via full-body pose keyframes, end-effector positions/rotations, and 2D paths/waypoints.

The model works on human skeletons and the Unitree G1 robot. Outputs can be plugged directly into MuJoCo or retargeted to other robots using GMR. A web-based interactive demo with a timeline editor is available, and inference runs locally with ~17GB VRAM. The model is open-sourced under Apache 2.0.

Technical Details

Training data: 700 hours of motion capture data
Input: Text prompts, keyframes, end-effector positions, 2D paths
Output: 3D human/robot motions
Hardware requirement: ~17GB VRAM for inference
License: Apache 2.0

How It Compares

Into the Omniverse: How OpenUSD and Synthetic Data Are Shaping the ...

Kimono is notable for its focus on humanoid robots, specifically the Unitree G1, and its open-source release under Apache 2.0. Other motion generation models (e.g., from Google DeepMind, Meta) often target human animation or specific tasks. Kimono's combination of text-to-motion, keyframe control, and robot retargeting is unique.

What to Watch

Real-world deployment: How well does Kimono generalize to unseen robots and environments?
Limitations: The model requires 17GB VRAM, limiting accessibility. Real-time performance on embedded hardware is unaddressed.
Adoption: Will the robotics community standardize on Kimono for motion generation?

Frequently Asked Questions

What is Kimono?

Kimono is a motion diffusion model from NVIDIA that generates 3D human and robot motions from text prompts, trained on 700 hours of motion capture data.

How do I control the motion output?

You can control motion via full-body pose keyframes, end-effector positions/rotations, and 2D paths/waypoints.

What robots does Kimono support?

It works on human skeletons and the Unitree G1 robot, with output retargetable to other robots using GMR.

Is Kimono open source?

Yes, it's released under Apache 2.0. A web demo is available, and inference runs locally with ~17GB VRAM.

gentic.news Analysis

NVIDIA's Kimono is a strategic move to commoditize motion generation for humanoid robots, a space where proprietary solutions have dominated. By open-sourcing under Apache 2.0, NVIDIA aims to set a default standard, similar to its strategy with Isaac Sim and Omniverse. The choice of Unitree G1 as a reference robot is telling — Unitree is a rising competitor to Boston Dynamics and Figure, and NVIDIA's support could accelerate G1 adoption.

The 700-hour mocap dataset is substantial but not unprecedented. Google DeepMind's RT-2 used far more data for web-based robot learning. What's novel is the diffusion-based approach to motion generation, which allows fine-grained control via keyframes and end-effector constraints — a step beyond text-only models.

However, the 17GB VRAM requirement limits deployment to high-end GPUs (e.g., RTX 4090, A6000, H100). Real-world robot controllers often run on embedded systems with far less memory. NVIDIA will need to provide quantization or distillation recipes for edge deployment. The MuJoCo integration is practical but doesn't address sim-to-real gap challenges.

This follows NVIDIA's broader push into robotics: the Isaac platform, Jetson hardware, and partnerships with Amazon Robotics and BMW. Kimono fills a specific gap — motion generation — that complements their existing perception and manipulation tools. The timing is noteworthy: humanoid robot startups raised over $1.5B in 2025, and the field is hungry for reusable software components.

Source: gentic.news · Apr 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Kimono's key technical contribution is using diffusion models for robot motion generation, which allows stochastic sampling of diverse, physically plausible motions from text. This contrasts with prior work using GANs or VAEs, which often produce less varied outputs. The diffusion approach also enables conditioning on multiple input types (text, keyframes, end-effector positions) in a unified framework. For practitioners, the most useful aspect is the retargeting to other robots via GMR — this means a single model trained on human mocap data can transfer to different robot morphologies. The 700-hour dataset is a strong foundation, but note that it's likely human mocap, not robot-specific data. The gap between human motion and robot execution (dynamics, torque limits, actuator constraints) remains unaddressed. A limitation is the lack of temporal consistency guarantees — diffusion models can produce jittery outputs across frames without explicit smoothing. The web demo's timeline editor suggests some manual refinement, but production deployment would require post-processing or recurrent architectures. The Apache 2.0 license is permissive, allowing commercial use, which should accelerate adoption in robotics labs.

#open source #motion generation #robotics #diffusion models #nvidia

Compare side-by-side

Unitree G1 vs MuJoCo

→

Mentioned in this article

Kimono Nvidia Unitree G1 MuJoCo

Enjoyed this article?