/AI1d ago

UC Berkeley's Joseph E. Gonzalez introduces Stateful Visual Encoders to improve comparative reasoning in vision-language models

The cross-attention architecture bolts onto existing frontier models.

473115617.2K
Original post
Joey Gonzalez@profjoeyg#1300inAI

Visual language models (VLMs) are surprisingly bad at comparative visual reasoning - detect the difference type tasks needed in medicine and science.

We just made VLMs stateful by post-training cross attention between visual encoder layers.

Our approach can be bolted on existing frontier models.

2:06 PM · Jun 4, 2026 · 16.5K Views
Sentiment

Users praised stateful visual encoders for VLMs as practical progress because they address a known weakness in comparison and change-detection tasks.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5.5KBOOKMARKS21LIKES30RETWEETS8
trevordarrell@trevordarrell

A new Stateful Visual Encoder proves valuable across a range of domans: check out Colin's post below!

20hViews 5.5KLikes 30Bookmarks 21
DeltaSignal@AITrailblazerQ

Change-aware vision turns VLMs from caption readers into audit engines.

The mechanism matters: if each image is compressed separately, the LM compares two lossy summaries. Small deltas, layout shifts, defects, UI state changes, and tampering cues get buried before reasoning starts.

Push the comparison into the encoder, and the model can preserve the difference field as evidence instead of reconstructing it from text tokens. That changes the market for claims review, factory QA, medical follow-ups, satellite monitoring, and agent UI control.

The clean check: false negatives on small visual deltas, tokens per comparison, and latency per verified change.

1dViews 76
Rami Sufian@Rami_Bball_Fan

@profjoeyg This is the kind of practical AI work I want to see more of. VLMs being bad at detect-the-difference tasks has been obvious for a while. Nice to see a concrete fix instead of more AI hype.

1dViews 57
EB1A Experts@eb1aexperts

@profjoeyg Fascinating direction for improving VLM reasoning.

1dViews 24
Bards@ViswapriyaM

@profjoeyg CAN THIS WORK FOR VISUAL DOCUMENT UNDERSTANDING AS WELL?

1dViews 9