Rohit Gandikota finds steering 9% of "Gaze Heads" in vision-language models controls text outputs in real-time

VIEWS563LIKES5

Try steering the gaze yourself, live in your browser (WebGPU, no install).

Your cursor becomes the model's gaze.

Project + demo: https://gaze.baulab.info Paper: https://arxiv.org/abs/2606.14703 Code: https://github.com/rohitgandikota/gaze-heads

With @davidbau at @Northeastern

6h5635

BOOKMARKS1

Rohit Gandikota@rohitgandikota

What do we mean by "gaze"? Turns out, the model looks at the specific image regions it’s currently describing.

So naturally, we thought: let’s force ALL attention heads to look at another region. The model speaks gibberish! 🗑️

The gaze lives inside specific model heads. Where?

6h26141

RETWEETS15

Rohit Gandikota@rohitgandikota

Vision-language AI models have a gaze. And you can steer it! 👀

Redirect just 9% of a model’s attention heads to any region in an image, and the VLM will start describing that region mid-generation. We call them Gaze Heads!

Try the demo: https://gaze.baulab.info/#demo 🧵👇

6h8K12177

REPLIES2

Rohit Gandikota@rohitgandikota

Is this one model's quirk? We ran the same pipeline on 10 VLMs.

Gaze heads appear in Qwen3-VL (2B-32B), Qwen2-VL, Ovis, InternVL3.5: 60-83% steering. LLaVA and Bunny show none.

The pattern? Frozen vision encoders. Bunny freezes the very backbone Ovis fine-tunes: 8.3% vs 68.7%.

6h1415

Rohit Gandikota@rohitgandikota

Even on natural photos, gaze-head attention settles on each object exactly while the model describes it. Steer the heads to a region and the model describes that region only.

On COCO, gaze steering reaches 76.5% vs 25.9% for matched non-gaze heads.

6h885

Rohit Gandikota@rohitgandikota

We studied this with comic strips.

Panels lay the story out left to right, so we always know where the model should be looking at every point in its narration. Ground truth that natural photos can't give.

The question: which heads move their attention along with the words?

6h1874

Rohit Gandikota@rohitgandikota

@NanoBanana First clue: prompt the model to "read the comic in reverse" and the activation difference gives a direction that flips its reading order. It only works in layers 20-28.

We tried all 720 panel orderings: only "reverse" has a vector.

So general reading order must live elsewhere🔍

6h1374

Rohit Gandikota@rohitgandikota

Do gaze heads only respond to questions? No.

During free narration, with no panel mentioned, their attention forms a staircase: panel 1 while describing panel 1, jumping to panel 2 as soon as the text moving on.

Ask it to narrate in reverse: the staircase mirrors.

6h1084

Rohit Gandikota@rohitgandikota

But tracking is just correlation. The causal test: force these 9% of heads to attend to a panel of our choice.

The model describes that panel instead!

Same image, same question, six different answers depending on where we point the gaze.

6h1064

Rohit Gandikota@rohitgandikota

Redirecting the top-100 gaze heads steers the answer 83.1% of the time (chance: 16.7%).

The same intervention on random heads: fails (14.6%).

On all 1,152 heads: generation collapses to junk (0.9%).

The lever is specific to these heads.

6h1024

Rohit Gandikota@rohitgandikota

Can we move the gaze inside VLMs mid-sentence?

We abruptly switch the target attention region for 9% of heads inside a VLM.

The model wraps up and starts describing the new region. All while stitching the narrative smoothly.

Try this yourself: https://gaze.baulab.info/#demo

6h954

Rohit Gandikota@rohitgandikota

How many heads does it take to steer a VLM?

5 heads: 36% control. 100 heads: 83.1%, the peak. More: accuracy falls.

You start trampling heads the model needs just to write fluently.

The gaze mechanism has a size.

6h934

David Bau@davidbau

Mousing masks a few "gaze" attention heads to a spotlight and instantly shows the causal effects.

Guide the model to the striped canopies, and then shift its attention to the autumn leaves, the fields of grain...

Then check out the preprint! http://gaze.baulab.info

Rohit Gandikota@rohitgandikota

Try steering the gaze yourself, live in your browser (WebGPU, no install).

Your cursor becomes the model's gaze.

Project + demo: https://gaze.baulab.info Paper: https://arxiv.org/abs/2606.14703 Code: https://github.com/rohitgandikota/gaze-heads

With @davidbau at @Northeastern

2h42330

Rohit Gandikota@rohitgandikota

We scored all 1,152 heads: ask about panel k, measure attention on panel k. Repeat for every k.

Heads that track the question light up the diagonal of a 6x6 matrix. We call them "Gaze Heads".

The top heads all sit in layers 20-28. No training needed, just forward passes.

6h1173

David Bau@davidbau

You can and read the preprint at https://gaze.baulab.info/#demo

But first on that page scroll to the demo with WebGPU machine (most modern laptops+browsers) and click on "start demo". It's worth the wait as it downloads the 2B VLM.

David Bau@davidbau

Oh man! I love this preprint and also the website Rohit made to demo it.

The gaze of a VLM is mediated by a much smaller set of attention heads than the full set, as if "conscious" attention is a small subset of "all attention heads". His demo lets you steer these in realtime.

2h31010

Gaurav Bilolikar@gbilolikar

@rohitgandikota Cool stuff!

2h621

God’s son - e/acc@nocturnalnether

@rohitgandikota I need this for OCR, any model to try to use in ocr?

54m22

Arnas Uselis@a_uselis

@rohitgandikota How is gazing affected by linear attention layers in qwen 3.5? Iirc there only 1/4 of the layers are full attention

33m7