8h ago

LILA Trains Visual Representations Using In-Context Learning On Videos

โ€”โ€”0โ€”โ€”
Original post

In-context learning suggests that a model has learned versatile representations. What if we use in-context learning itself as a training task for visual representations? ๐Ÿ“ฃ Introducing ๐—Ÿ๐—œ๐—Ÿ๐—”: ๐—Ÿ๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ ๐—œ๐—ป-๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด โœจ @CVPR 2026 Oral โœจ ๐—Ÿ๐—œ๐—Ÿ๐—” trains on videos without manual annotation. Key idea: An optimal linear mapping that predicts dense cues (e.g. depth, flow), estimated on one video frame, should also predict the corresponding cues of another frame from the same video. This yields compelling results on dense vision tasks: video object segmentation, (zero-shot) semantic segmentation and surface normal estimation. Paper, code, models and demo: https://lila-pixels.github.io Joint work with @ma_sundermeyer, Hidenobu Matsuki, David Joseph Tan and @fedassa (and special thanks to David and Federico for hosting my research visit at Google). #cvpr2026 @Google @MunichCenterML @tumcvg @TU_Muenchen

2:29 AM ยท May 28, 2026 View on X
LILA Trains Visual Representations Using In-Context Learning On Videos ยท Digg