/Tech1h ago

DUNE and AM-Radio Encoders Outperform DinoV3 in RGB Robot Navigation

6813362

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS43LIKES1

Christian Wolf (🦋🦋🦋)@chriswolfvision

This is work at @naverlabseurope by - Steeven Janny @JannySteeven - Leonid Antsfeld - Christian Wolf

http://arxiv.org/abs/2606.21216

Accepted to ECCV 2026, builds on our real nav agent introduced in CVPR 2024, CVPR 2025.

2/8

1h431

REPLIES1

Christian Wolf (🦋🦋🦋)@chriswolfvision

The n.o. read-out tokens (features values per patch) can be decreased to 1 scalar value per patch. We would like to stress again that in this case the "attention map" (see Figure) is the ONLY information on the scene the policy receives ("hard" interpretability?).

6/8

1h9

Christian Wolf (🦋🦋🦋)@chriswolfvision

Restricting the features provided to the policy also decreases the sim2real gap significantly.

7/8

1h33

Christian Wolf (🦋🦋🦋)@chriswolfvision

The policies take RGB input, no Lidar. Encoders distilled from multiple heterog. teachers (DUNE, AMRADIO) perform better.

Dino-v3 surprisingly fails and we have confirmed this in numerous experiments exploring a large set of hyper-parameters for its integration.

3/8

1h15

Christian Wolf (🦋🦋🦋)@chriswolfvision

One of the key features are projection mechanisms (querying the representation) from patch features to the policy input. In our experiments the best and most robust solution are learnable read-out tokens, similar the "Perceiver Resampler" as this is called for VLMs.

5/8

1h9

Christian Wolf (🦋🦋🦋)@chriswolfvision

When training these agents taking RGB input it is beneficial to finetune agents having taken privileged information first (Lidar input during RL pre-training). The pre-trained and then finetuned policy seems to learn to extract better visual features from the ViTs.

4/8

1h5

Christian Wolf (🦋🦋🦋)@chriswolfvision

Projected onto a map, the attention maps correlate with scene structure and with affordances related to it. These maps have been computed on real nav data in a real environment (like ALL results in this paper).

More results: http://arxiv.org/abs/2606.21216 ECCV 2026 8/8

1h27