ByteDance Seed team replaces noisy 6DoF hand poses with relative wrist translation for robot learning. The method creates a shared action space for humans and bi-manual robots that scales with cheap video data.
Key facts
- ByteDance Seed replaces 6DoF poses with relative wrist translation
- Shared action space for humans and bi-manual robots
- Method scales with cheap human video data
- Outperforms full-pose baselines per the team's claim
ByteDance Seed team turns cheap human videos into robot skills. They replace noisy 6DoF hand poses with relative wrist translation—a shared action space for humans and bi-manual robots. The method scales with cheap data, outperforming full-pose baselines According to @HuggingPapers.
Why Relative Wrist Translation Works
![]()
Full 6DoF hand poses are notoriously noisy when extracted from monocular video, introducing jitter and ambiguity that degrade policy learning. ByteDance Seed's key insight: by mapping both human and robot end-effector motion to a common relative wrist translation space, the model learns to ignore irrelevant finger articulation and focus on task-relevant trajectory. This shared representation enables direct transfer from low-cost human demonstration videos to bi-manual robot control without expensive motion capture or teleoperation setups.
Scaling with Cheap Data
The approach leverages readily available human video data—no specialized recording rigs, no robot-specific demos. The team reports that scaling the training dataset with more human videos directly improves policy performance, a property that full-pose baselines lack due to noise accumulation. While specific benchmark numbers are not disclosed in the tweet, the claim of outperforming full-pose baselines suggests a meaningful gap in success rate or task completion under distribution shift.
Implications for Robot Learning

This work aligns with a broader trend in robotics: replacing expensive, high-dimensional supervision with cheaper, lower-dimensional representations that preserve task-relevant information. ByteDance Seed's result echoes findings from recent imitation learning research showing that action space reduction often beats complex pose estimation pipelines. For bi-manual tasks where coordination matters, a shared wrist translation space may simplify the credit assignment problem across two arms.
What to watch
Watch for a full paper release from ByteDance Seed with benchmark numbers on standard robot manipulation tasks (e.g., RoboTurk, MetaWorld), and whether the approach generalizes to novel objects or scenes not seen in training.









