/Tech20h ago

CausaLab launches as an interactive benchmark to evaluate LLM agents on causal discovery in simulated laboratory environments

It scores causal mechanism reconstruction rather than final outputs.

994176448.6K

Original post unavailable.

/Tech20h ago

CausaLab launches as an interactive benchmark to evaluate LLM agents on causal discovery in simulated laboratory environments

It scores causal mechanism reconstruction rather than final outputs.

994176448.6K

Original post unavailable.

Sentiment

Many users praise and thank contributors for CausaLab's tests of LLM agents on causal discovery, while one criticizes the models as flawed causal parrots that output wrong graphs despite accuracy scores.

Pos

75.0%

Neg

25.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Dylan Zhang@dylan_works_

Fun things we kept seeing:

right answer, wrong mechanism (great prediction, wrong causal graph) give the agent the perfect experiments and its structure recovery gets worse — discovery is in choosing them, not the data it stops experimenting way too early

1d5902

BOOKMARKS1

Dylan Zhang@dylan_works_

Paper link: https://arxiv.org/abs/2605.26029

22h681

LIKES5

Dylan Zhang@dylan_works_

In spirit, this echoes the principle of interactive evaluation my friends recently worked on :) @keyang_xuan @p_song1 @lupantech @pengrui_han

1d2815

REPLIES1

Zhengyao Jiang@zhengyaojiang

@dylan_works_ Pretty cool! Congrats

21h4081

Dylan Zhang@dylan_works_

Takeaway: prediction accuracy hides whether a model really understands. The principle for anyone building causal/scientific agents — score the mechanism, not just the answer; let it run its own experiments; demand transfer; make it verify before committing.

1d3511

Dylan Zhang@dylan_works_

Importantly, many thanks to the inputs from causal experts which helped shape our work @XiangchenSong @chenyuen0103!

1d2231

Dylan Zhang@dylan_works_

Most importantly, shout out for my faithful coauthor @junlin45300

1d1411