🧩 #ICML2026 💥 How can a model discover the 3D objects in a scene—their shape, color, and position—without any labels? Introducing 3D-DLP, a self-supervised object-centric model that decomposes colored 3D scenes (RGB-D and voxels) into a set of 3D latent particles.
Researchers introduce 3D-DLP, a self-supervised model that decomposes 3D scenes into latent particles without human annotations
The model enables unsupervised scene understanding for robotic manipulation.
Users thank collaborators on the 3D-DLP model for discovering objects in 3D scenes without labels.
No Digg Deeper questions have been answered for this story yet.
Most Activity

Self-supervised object-centric models break scenes into entities with no labels, but progress has stayed in 2D—which can't recover occlusions or precise geometry. We extend Deep Latent Particles (DLP), a fully unsupervised VAE over latent particles, directly into 3D.

Finally, we ask whether 3D particles help downstream control. We feed 3D-DLP tokens into an entity-centric diffusion policy (EC-Diffuser) and evaluate on 12 MimicGen and 10 language-conditioned RLBench tasks.

3D-DLP is a first practical bridge from self-supervised 3D scene decomposition to downstream control. Open challenges remain—scaling to dynamic, cluttered, in-the-wild scenes—and extending to dynamics and world modeling in 3D particle space.

We introduce three variants for three sensing modalities: 3D-DLP-D for RGB-D, 3D-DLP-V for occupancy voxels, and 3D-DLP-VC for colored RGB voxels—the most general and most challenging of the three.

Each particle carries explicit, disentangled 3D attributes: a keypoint position, a bounding-box scale, a presence value, and appearance features. Unlike 2D DLP, occlusion is handled directly by the 3D rendering instead of an explicit latent variable.

Plain MSE reconstruction has a failure mode: it can match brightness using gray and wash out color, which we call “gray collapse”. A chroma loss penalizes color error on occupied voxels, recovering faithful hue and saturation

The learned latents are interpretable and controllable. Move a particle's 3D keypoint and the object translates; change its scale and it resizes—confirming that particles encode genuinely editable 3D object properties

Porting 2D DLP to voxels doesn't just work out of the box. We identify two components that make it possible: an appearance-aware K-means keypoint prior, and a chroma reconstruction loss. We validate both through ablations.

The spatial-softmax prior from 2D DLP collapses on sparse voxels, so instead we cluster occupied voxels in a joint color (CIELAB) and 3D-position space, weighted by lightness. This places keypoints right on object surfaces and color boundaries.

The decoder maps each particle to a canonical cubic RGBA patch, places it into the global grid with a 3D spatial transformer, and volumetrically composites it with the background. Everything is trained end-to-end as a VAE over the ELBO.

They do, and the 3D lift matters: 48.1% mean success on MimicGen vs 30.8/34.1% for 2D-DLP and 47.3% for a dense raw voxel policy. On RLBench, 3D-DLP wins 9 of 10 matched-compute tasks.

Put together, 3D-DLP discovers semantic keypoints, boxes, and per-object masks with no supervision, and reconstructs scenes far more faithfully than non-object-centric AE/VAE baselines (24.4 vs 11.4 masked PSNR on MimicGen).

Huge thanks to my collaborators:@madhiyen , Amir Zadeh and Chuan Li (@LambdaAPI), @davheld, @pathak2206, and @TalDaniel8🙏 🌐 Website: https://eubooks3003.github.io/3d-dlp/ 📄 Paper: https://arxiv.org/abs/2606.19451 💻 Code: https://github.com/Eubooks3003/3d-dlp