A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen. CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same. https://dylanzsz.github.io/causalab/
CausaLab benchmark evaluates LLM agents on causal discovery by placing them in fictional simulated science labs
Fictional worlds prevent agents from relying on memorized training data
Positive users celebrate CausaLab's LLM agent tests on causal discovery with congratulations and thanks to collaborators, while the negative reply dismisses results as producing wrong graphs despite accuracy numbers.
Most Activity

Fun things we kept seeing:
right answer, wrong mechanism (great prediction, wrong causal graph) give the agent the perfect experiments and its structure recovery gets worse — discovery is in choosing them, not the data it stops experimenting way too early

Paper link: https://arxiv.org/abs/2605.26029

In spirit, this echoes the principle of interactive evaluation my friends recently worked on :) @keyang_xuan @p_song1 @lupantech @pengrui_han
A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen. CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same. https://dylanzsz.github.io/causalab/

@dylan_works_ Pretty cool! Congrats

Takeaway: prediction accuracy hides whether a model really understands. The principle for anyone building causal/scientific agents — score the mechanism, not just the answer; let it run its own experiments; demand transfer; make it verify before committing.

Importantly, many thanks to the inputs from causal experts which helped shape our work @XiangchenSong @chenyuen0103!

Most importantly, shout out for my faithful coauthor @junlin45300

@dylan_works_

@zhengyaojiang Thanks for your interest!

@dylan_works_ Science is sound, vibration, elements opening, matter. Its all here

@dylan_works_ @keyang_xuan @p_song1 @lupantech 🥳🥳

@dylan_works_ 92% accuracy, 0.47 F1. we're building causal parrots. right number, wrong graph, and they quit halfway through the budget. textbook.