
This is work at @naverlabseurope by - Steeven Janny @JannySteeven - Leonid Antsfeld - Christian Wolf
http://arxiv.org/abs/2606.21216
Accepted to ECCV 2026, builds on our real nav agent introduced in CVPR 2024, CVPR 2025.
2/8
No Digg Deeper questions have been answered for this story yet.

This is work at @naverlabseurope by - Steeven Janny @JannySteeven - Leonid Antsfeld - Christian Wolf
http://arxiv.org/abs/2606.21216
Accepted to ECCV 2026, builds on our real nav agent introduced in CVPR 2024, CVPR 2025.
2/8

The n.o. read-out tokens (features values per patch) can be decreased to 1 scalar value per patch. We would like to stress again that in this case the "attention map" (see Figure) is the ONLY information on the scene the policy receives ("hard" interpretability?).
6/8

Restricting the features provided to the policy also decreases the sim2real gap significantly.
7/8

The policies take RGB input, no Lidar. Encoders distilled from multiple heterog. teachers (DUNE, AMRADIO) perform better.
Dino-v3 surprisingly fails and we have confirmed this in numerous experiments exploring a large set of hyper-parameters for its integration.
3/8

One of the key features are projection mechanisms (querying the representation) from patch features to the policy input. In our experiments the best and most robust solution are learnable read-out tokens, similar the "Perceiver Resampler" as this is called for VLMs.
5/8

When training these agents taking RGB input it is beneficial to finetune agents having taken privileged information first (Lidar input during RL pre-training). The pre-trained and then finetuned policy seems to learn to extract better visual features from the ViTs.
4/8

Projected onto a map, the attention maps correlate with scene structure and with affordances related to it. These maps have been computed on real nav data in a real environment (like ALL results in this paper).
More results: http://arxiv.org/abs/2606.21216 ECCV 2026 8/8