Oh man! I love this preprint and also the website Rohit made to demo it.
The gaze of a VLM is mediated by a much smaller set of attention heads than the full set, as if "conscious" attention is a small subset of "all attention heads". His demo lets you steer these in realtime.
Vision-language AI models have a gaze. And you can steer it! 👀
Redirect just 9% of a model’s attention heads to any region in an image, and the VLM will start describing that region mid-generation. We call them Gaze Heads!
Try the demo: https://gaze.baulab.info/#demo 🧵👇



