/Tech6h ago

Distillation Transfers Transformer Compression Into Recurrent Student Memory

57613463.3K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS47REPLIES1

Christian Wolf (🦋🦋🦋)@chriswolfvision

A qualitative example for one scene and trajectory.

7/8

6h471

LIKES1RETWEETS1

Christian Wolf (🦋🦋🦋)@chriswolfvision

This is work at @naverlabseurope by

- Philippe Weinzaepfel (@WeinzaepfelP - Mert Bulent Sariyildiz (@mbsariyildiz) - Yours truly - Guillaume Bono (@_WGW101) - Gianluca Monaci

http://arxiv.org/abs/2606.21562

7/8

6h451

Christian Wolf (🦋🦋🦋)@chriswolfvision

We apply this to the robotics task "Mem-RPE" / "Mem cond. relative pose est.". Like map-free localization, the poses of query images need to be determined, but for a MOVING coordinate frame centered on an agent. We beat SOTA rec models and are comparable to transformers.

5/8

6h201

Christian Wolf (🦋🦋🦋)@chriswolfvision

Let's learn how to COMPRESS data: recurrent models learn how to retain/throw away information at each time step. A wrong decision is forever. We train a specific bottleneck transformer teacher with access to the obs history, which compresses it into a fixed size repr.

2/8

6h161

Christian Wolf (🦋🦋🦋)@chriswolfvision

Both teacher and student learn how to compress:

- The teacher has access to the full obs history (privileged information!) and compresses it into a fixed size repr.

- The student is recurrent and needs to perform this compression on the fly, w/o access to past obs.

3/8

6h151

Christian Wolf (🦋🦋🦋)@chriswolfvision

We visualize pose estimation accuracy for different sequences lengths (rows) and recentness/age of the query image (evaluated queries are not actual observations but close viewpoints) and provide a big boost compared to recurrent models without distillation.

6/8

6h101

Christian Wolf (🦋🦋🦋)@chriswolfvision

We distill the fixed sized teacher representation into the recurrent memory. This distills one COMPRESSION mechanism into another one.

Teacher memories at different time steps are backpropagated over length-limited sub sequences "segments".

4/8

6h101