/AI16h ago

CausaLab benchmark evaluates LLM agents on causal discovery by placing them in fictional simulated science labs

Fictional worlds prevent agents from relying on memorized training data

990146248.3K
Original post
Dylan Zhang@dylan_works_

A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen. CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same. https://dylanzsz.github.io/causalab/

2:18 PM · Jun 6, 2026 · 48.3K Views
Sentiment

Positive users celebrate CausaLab's LLM agent tests on causal discovery with congratulations and thanks to collaborators, while the negative reply dismisses results as producing wrong graphs despite accuracy numbers.

Pos
75.0%
Neg
25.0%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS590
Dylan Zhang@dylan_works_

Fun things we kept seeing:

right answer, wrong mechanism (great prediction, wrong causal graph) give the agent the perfect experiments and its structure recovery gets worse — discovery is in choosing them, not the data it stops experimenting way too early

1dViews 590Likes 2
BOOKMARKS1
Dylan Zhang@dylan_works_

Paper link: https://arxiv.org/abs/2605.26029

18hViews 68Bookmarks 1
LIKES5
Dylan Zhang@dylan_works_

In spirit, this echoes the principle of interactive evaluation my friends recently worked on :) @keyang_xuan @p_song1 @lupantech @pengrui_han

1dViews 281Likes 5
RETWEETS14
Dylan Zhang@dylan_works_

A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen. CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same. https://dylanzsz.github.io/causalab/

1dViews 48.3KLikes 91Bookmarks 63
REPLIES1
Zhengyao Jiang@zhengyaojiang

@dylan_works_ Pretty cool! Congrats

17hViews 408Likes 1
Dylan Zhang@dylan_works_

Takeaway: prediction accuracy hides whether a model really understands. The principle for anyone building causal/scientific agents — score the mechanism, not just the answer; let it run its own experiments; demand transfer; make it verify before committing.

1dViews 351Likes 1
Dylan Zhang@dylan_works_

Importantly, many thanks to the inputs from causal experts which helped shape our work @XiangchenSong @chenyuen0103!

1dViews 223Likes 1
Dylan Zhang@dylan_works_

Most importantly, shout out for my faithful coauthor @junlin45300

1dViews 141Likes 1
Dylan Zhang@dylan_works_

@zhengyaojiang Thanks for your interest!

17hViews 47Likes 1
senorstoic@senorstoic

@dylan_works_ Science is sound, vibration, elements opening, matter. Its all here

1dViews 59

@dylan_works_ @keyang_xuan @p_song1 @lupantech 🥳🥳

1dViews 56
Puzzle Paws@paws4puzzles

@dylan_works_ 92% accuracy, 0.47 F1. we're building causal parrots. right number, wrong graph, and they quit halfway through the budget. textbook.

12hViews 29