TBench Science / @DillmannSteven, @ryanmart3n, @alexgshaw, @Mike_A_Merril, @AlexGDimakis, @sanmikoyejo, @lschmidt3 (@Stanford) A benchmark for evaluating AI agents on real computational workflows across the natural sciences, with tasks authored and verified by scientific domain experts.
Stanford's Sanmi Koyejo and collaborators release Terminal-Bench-Science to evaluate AI agents on real-world scientific terminal workflows
Domain experts directly authored and verified all benchmark tasks
Users congratulated the Stanford team and Laude Institute on releasing the TBench Science Benchmark for AI Agents, praising the exceptional batch of projects assembled by the researchers.
No Digg Deeper questions have been answered for this story yet.
Most Activity

Congrats to Research Partner @bradenjhancock and Laude Institute co-founder @ChrisRytting on assembling another exceptional batch. Every project in Slingshots // THREE ships open source. Full announcement: http://laude.org/updates/slingshots-three
Honored to have Terminal-Bench-Science included in Slingshots // THREE, alongside such a strong lineup of researchers and projects. Building a benchmark to evaluate AI agents on computational workflows across the natural sciences — authored and verified by real domain experts. Grateful for the incredible support from @LaudeInstitute & @bradenjhancock, and to all our contributors making this happen. ⚛️🧪
Check out the current progress on our brand-new task submission dashboard: https://stevendillmann.github.io/tb-science-task-dashboard/
TBench Science / @DillmannSteven, @ryanmart3n, @alexgshaw, @Mike_A_Merril, @AlexGDimakis, @sanmikoyejo, @lschmidt3 (@Stanford) A benchmark for evaluating AI agents on real computational workflows across the natural sciences, with tasks authored and verified by scientific domain experts.

@ryanmart3n @alexgshaw @AlexGDimakis @sanmikoyejo @lschmidt3 @Stanford @StevenDillmann @Mike_A_Merrill