In-context learning suggests that a model has learned versatile representations. What if we use in-context learning itself as a training task for visual representations?
๐ฃ Introducing ๐๐๐๐: ๐๐ถ๐ป๐ฒ๐ฎ๐ฟ ๐๐ป-๐๐ผ๐ป๐๐ฒ๐
๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด โจ @CVPR 2026 Oral โจ
๐๐๐๐ trains on videos without manual annotation.
Key idea: An optimal linear mapping that predicts dense cues (e.g. depth, flow), estimated on one video frame, should also predict the corresponding cues of another frame from the same video.
This yields compelling results on dense vision tasks: video object segmentation, (zero-shot) semantic segmentation and surface normal estimation.
Paper, code, models and demo: https://lila-pixels.github.io
Joint work with @ma_sundermeyer, Hidenobu Matsuki, David Joseph Tan and @fedassa (and special thanks to David and Federico for hosting my research visit at Google).
#cvpr2026 @Google @MunichCenterML @tumcvg @TU_Muenchen
