Random thoughts about hallucinate, and world models. VLMs often answer image questions correctly with no image attached. This is the "mirage" effect (Asadi et al.), which inflates multimodal benchmark scores. In our recent paper, “Mirage Probes”, we asked why it happens.
Positive users highlight the Mirage Probes paper's mildly optimistic results showing that VLMs' reliance on imagined content is linearly decodable and thus potentially flaggable.
No Digg Deeper questions have been answered for this story yet.
Most Activity
We found two reasons: Textual bias. The question text alone points to one confident answer, so the model never touches its visual representations at all. Why look at the image?
Random thoughts about hallucinate, and world models. VLMs often answer image questions correctly with no image attached. This is the "mirage" effect (Asadi et al.), which inflates multimodal benchmark scores. In our recent paper, “Mirage Probes”, we asked why it happens.
Spurious images. The text isn't enough to answer directly, but it evokes visual priors. So the model builds a fake image in latent space and answers from that, as if it were grounded.
We found two reasons: Textual bias. The question text alone points to one confident answer, so the model never touches its visual representations at all. Why look at the image?
Both are visible inside the model. Mirage behavior is linearly decodable from internal activations even when the image IS present, and a text-only baseline can't recover the signal. We also found that you can separates mirage and mirage using the activations
Spurious images. The text isn't enough to answer directly, but it evokes visual priors. So the model builds a fake image in latent space and answers from that, as if it were grounded.
Cleaning benchmark text can fix reason 1 but not 2. Spurious images live in the model's visual representations, so faithful grounding needs interventions at that level. One question is whether reason 2 is even a bug?
Both are visible inside the model. Mirage behavior is linearly decodable from internal activations even when the image IS present, and a text-only baseline can't recover the signal. We also found that you can separates mirage and mirage using the activations

So the line isn't "models shouldn't imagine." We want models that complete scenes, predict, simulate. The line is that the model needs to know, and tell us, which parts came from input and which parts it filled in.

Ask a model what's next to the oven in a kitchen it's never seen. The right answer requires hallucinating: probably a fridge, a counter, and some cabinets. That's not a failure. That's what a world model is for.

None of this is specific to vision. A fabricated citation is the same move: a world model of the literature completing a gap with plausible authors and a plausible year. Ignoring retrieved context when priors are strong is another example.

So a spurious image is a world model doing its job at the wrong moment: Gap-filling is imagination when you're planning, and a mirage when you're supposed to report what you see.

Humans run on this machinery too. Perception is "controlled hallucination": the brain predicts the scene and uses input to correct errors. Your blind spot gets filled in from priors every waking second. What you have that models don't is source monitoring

Why do models gap-fill instead of saying "I can't tell"? Because we pay them to. Benchmarks grade correctness, not groundedness, so guessing from priors has positive expected value and abstaining has zero. Mirage is the optimal policy under our evals.

Everyone wants models with "good world models," but it's unclear what that means beyond describing the world well. A good world model isn't just perception. It predicts what's probably there when you don't have the input.

Our results are mildly optimistic here: if "I'm running on imagined content" is linearly decodable, it's flaggable in principle. Grounding without imagination is a camera. Imagination without grounding is a mirage. The interesting problem is the seam.