/AI1d ago

UC Berkeley's Joseph E. Gonzalez introduces Stateful Visual Encoders to improve comparative reasoning in vision-language models

The cross-attention architecture bolts onto existing frontier models.

473115617.2K

#1014

Original post

Joey Gonzalez@profjoeyg#1300inAI

Visual language models (VLMs) are surprisingly bad at comparative visual reasoning - detect the difference type tasks needed in medicine and science.

We just made VLMs stateful by post-training cross attention between visual encoder layers.

Our approach can be bolted on existing frontier models.

2:06 PM · Jun 4, 2026 · 16.5K Views

Sentiment

Users praised stateful visual encoders for VLMs as practical progress because they address a known weakness in comparison and change-detection tasks.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.5KBOOKMARKS21LIKES30RETWEETS8

trevordarrell@trevordarrell

A new Stateful Visual Encoder proves valuable across a range of domans: check out Colin's post below!

20h5.5K3021

DeltaSignal@AITrailblazerQ

Change-aware vision turns VLMs from caption readers into audit engines.

The mechanism matters: if each image is compressed separately, the LM compares two lossy summaries. Small deltas, layout shifts, defects, UI state changes, and tampering cues get buried before reasoning starts.

Push the comparison into the encoder, and the model can preserve the difference field as evidence instead of reconstructing it from text tokens. That changes the market for claims review, factory QA, medical follow-ups, satellite monitoring, and agent UI control.

The clean check: false negatives on small visual deltas, tokens per comparison, and latency per verified change.

1d76

Rami Sufian@Rami_Bball_Fan

@profjoeyg This is the kind of practical AI work I want to see more of. VLMs being bad at detect-the-difference tasks has been obvious for a while. Nice to see a concrete fix instead of more AI hype.

1d57

EB1A Experts@eb1aexperts

@profjoeyg Fascinating direction for improving VLM reasoning.

1d24

Bards@ViswapriyaM

@profjoeyg CAN THIS WORK FOR VISUAL DOCUMENT UNDERSTANDING AS WELL?

1d9