/Tech5h ago

Google DeepMind's Neel Nanda and team use open-ended agents to automatically detect behavioral differences between language models

The method automates manual model comparison tasks

141771310121.8K

#254

Original post

Neel Nanda@NeelNanda5#254inTech

I'm big believer in just doing the obvious thing. Turns out you can diff two models by just asking an agent to do it!

bilal@bilalchughtai_

New research update from the Google DeepMind Language Model Interpretability team.

We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.

10:35 AM · Jun 12, 2026 · 9.2K Views

Sentiment

Users praise DeepMind's simple AI agents for model behavior diffing because straightforward methods prove effective at surfacing differences worth investigating.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

bilal@bilalchughtai_

Our diffing agents are conceptually much simpler than most other methods researchers use to study model differences - they are simple LLM scaffolds with access to tools providing model rollouts from the pair of models on prompts.

6h636121

BOOKMARKS4LIKES12

bilal@bilalchughtai_

More details in the blogpost: https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/building-and-evaluating-model-diffing-agents

Work with @JoshAEngels and @NeelNanda5

6h421124

RETWEETS8

bilal@bilalchughtai_

New research update from the Google DeepMind Language Model Interpretability team.

We build and evaluate dead simple open-ended model diffing agents tasked with studying the behavioural differences between two models, and find them to be promising in practice.

6h12.8K8350

REPLIES1

Celeste@celestepoasts

@NeelNanda5

4h1928

bilal@bilalchughtai_

We think more mature versions of tools of this form may play a useful role in understanding and improving model behaviour. For instance, they may help answer the question: what is the behavioural effect of training models with subtly different constitutions?

6h44582

bilal@bilalchughtai_

We validate our diffing agents work through evaluations on a number of settings where there is either a known lack of difference or known true difference. We also compare them directly with single model auditing agents tasked with finding strange behaviours.

6h578101

bilal@bilalchughtai_

Our diffing agents find interesting real differences between various production Gemini models.

6h46591

Jatin Nainani@jatin_n0

@bilalchughtai_ Nice work! seems very useful in surfacing behavior differences to investigate deeper! I notice your lesswrong values "novelty" - I think @mbodhisattwa 's work https://arxiv.org/pdf/2507.00310 might be useful quantifying surprise in behavior differences

4h2061

Alex Tomala@a__tomala

@celestepoasts @NeelNanda5 omg

4h431

Ankit Maloo@ankit2119

@bilalchughtai_ i did it the hard way https://kwbench.github.io/insights

staggering how different these models are

3h124

Ely Hahami@ElyHahami

@NeelNanda5 nice!

4h37

Sai Vegasena@svegas18

@NeelNanda5 could steering a single feature instead of finetuning give you a cleaner organism here? Or would that just produce a different set of salient changes?

2h22

Glenn Matlin@GlennMatlin

@NeelNanda5 Sometimes the obvious method is just the best method. especially in AI world where folks assume the obvious way did not work because everyone else tried it and failed. very cool work was relevant to ongoing discussion thanks

2h12

Virgil Maro@_virgil19

@bilalchughtai_ diffing sidesteps the hardest part of interp cuz the agent has a contrast to anchor on. 'these two differ here' is way easier to ground than 'this is what the model does'

2h1

Alex YGift@Radipdegen

@NeelNanda5 sometimes the best solution is the one you almost skipped bc it seemed too simple

Rugbist@rugbist_

@NeelNanda5 the obvious thing is always the one nobody does until someone proves it works