8h ago

LILA Trains Visual Representations Using In-Context Learning On Videos

72413417716.3K

——0——

Original post

In-context learning suggests that a model has learned versatile representations. What if we use in-context learning itself as a training task for visual representations? 📣 Introducing 𝗟𝗜𝗟𝗔: 𝗟𝗶𝗻𝗲𝗮𝗿 𝗜𝗻-𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 ✨ @CVPR 2026 Oral ✨ 𝗟𝗜𝗟𝗔 trains on videos without manual annotation. Key idea: An optimal linear mapping that predicts dense cues (e.g. depth, flow), estimated on one video frame, should also predict the corresponding cues of another frame from the same video. This yields compelling results on dense vision tasks: video object segmentation, (zero-shot) semantic segmentation and surface normal estimation. Paper, code, models and demo: https://lila-pixels.github.io Joint work with @ma_sundermeyer, Hidenobu Matsuki, David Joseph Tan and @fedassa (and special thanks to David and Federico for hosting my research visit at Google). #cvpr2026 @Google @MunichCenterML @tumcvg @TU_Muenchen

2:29 AM · May 28, 2026

LILA Trains Visual Representations Using In-Context Learning On Videos

Cluster engagement

Sentiment