A feed-forward model decomposes 3D scenes into instance-structured token groups from unposed images, without 3D annotations. Native object identity unlocks reconstruction, segmentation, and manipulation in one forward pass.
Key facts
- Feed-forward model decomposes 3D scenes into instance-structured token groups.
- No 3D annotations required for training.
- Enables reconstruction, segmentation, and manipulation in one forward pass.
- Model processes unposed 2D images without camera pose information.
- Source is a social media post; no quantitative results disclosed.
A new feed-forward model, detailed in a paper shared on @HuggingPapers, treats 3D scenes as collections of objects rather than geometric primitives. The model takes unposed 2D images—images without camera pose information—and directly outputs instance-structured token groups, each corresponding to a distinct object in the scene.
Unlike prior methods that rely on 3D bounding box annotations or multi-view supervision, this approach requires no 3D annotations during training. The model learns to identify object instances purely from 2D image pairs, using a feed-forward architecture that processes the input in a single forward pass.
The output token groups enable three downstream tasks: 3D reconstruction of each object, semantic segmentation of the scene, and object-level manipulation (e.g., removing or repositioning objects). The paper claims that native object identity—the model's ability to recognize objects as discrete entities—is the key innovation, allowing the system to generalize across scenes without explicit 3D supervision.
Unique Take: This work flips the conventional 3D vision pipeline on its head. Most systems (e.g., NeRF variants or DUSt3R) first reconstruct the full scene as a continuous volume or point cloud, then segment objects post-hoc. This model skips the full-scene reconstruction step entirely, instead outputting object-level representations directly. The trade-off: it likely sacrifices fine-grained geometric detail for computational efficiency and object-level abstraction. The source does not disclose inference speed, model size, or benchmark scores against methods like SAM-3D or OpenMask3D, making it hard to gauge practical performance.
Limitations: The source is a brief social media post, not a full paper. No quantitative results, ablation studies, or comparisons to prior art are provided. The model's ability to handle occluded objects, varying numbers of instances, or real-world cluttered scenes remains unknown.
What to watch
Watch for the full paper release (likely on arXiv) with quantitative benchmarks on ScanNet or Replica, and comparisons to DUSt3R and SAM-3D. If the model achieves competitive segmentation accuracy (>85% mIoU) while being 10x faster than NeRF-based methods, it could reshape 3D scene understanding workflows.









