/AI16h ago

CausaLab benchmark evaluates LLM agents on causal discovery by placing them in fictional simulated science labs

Fictional worlds prevent agents from relying on memorized training data

990146248.3K

#566

Original post

Dylan Zhang@dylan_works_

A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen. CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same. https://dylanzsz.github.io/causalab/

2:18 PM · Jun 6, 2026 · 48.3K Views

/AI16h ago

CausaLab benchmark evaluates LLM agents on causal discovery by placing them in fictional simulated science labs

Fictional worlds prevent agents from relying on memorized training data

990146248.3K

#566

Original post

Dylan Zhang@dylan_works_

2:18 PM · Jun 6, 2026 · 48.3K Views

Sentiment

Positive users celebrate CausaLab's LLM agent tests on causal discovery with congratulations and thanks to collaborators, while the negative reply dismisses results as producing wrong graphs despite accuracy numbers.

Pos

75.0%

Neg

25.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Dylan Zhang@dylan_works_

Fun things we kept seeing:

right answer, wrong mechanism (great prediction, wrong causal graph) give the agent the perfect experiments and its structure recovery gets worse — discovery is in choosing them, not the data it stops experimenting way too early

1d5902

BOOKMARKS1

Dylan Zhang@dylan_works_

Paper link: https://arxiv.org/abs/2605.26029

18h681

LIKES5

Dylan Zhang@dylan_works_

In spirit, this echoes the principle of interactive evaluation my friends recently worked on :) @keyang_xuan @p_song1 @lupantech @pengrui_han

1d2815

RETWEETS14

Dylan Zhang@dylan_works_

1d48.3K9163

REPLIES1

Zhengyao Jiang@zhengyaojiang

@dylan_works_ Pretty cool! Congrats

17h4081

Dylan Zhang@dylan_works_

Takeaway: prediction accuracy hides whether a model really understands. The principle for anyone building causal/scientific agents — score the mechanism, not just the answer; let it run its own experiments; demand transfer; make it verify before committing.

1d3511

Dylan Zhang@dylan_works_

Importantly, many thanks to the inputs from causal experts which helped shape our work @XiangchenSong @chenyuen0103!

1d2231

Dylan Zhang@dylan_works_

Most importantly, shout out for my faithful coauthor @junlin45300

1d1411

Maciek Telecki@g4dz10r3k

@dylan_works_

1d491

Dylan Zhang@dylan_works_

@zhengyaojiang Thanks for your interest!

17h471

senorstoic@senorstoic

@dylan_works_ Science is sound, vibration, elements opening, matter. Its all here

1d59

Pengrui Han (Barry)@pengrui_han

@dylan_works_ @keyang_xuan @p_song1 @lupantech 🥳🥳

1d56

Puzzle Paws@paws4puzzles

@dylan_works_ 92% accuracy, 0.47 F1. we're building causal parrots. right number, wrong graph, and they quit halfway through the budget. textbook.

12h29