Terminal-Bench Science extends the original Terminal-Bench benchmark used by Anthropic, OpenAI, and Google DeepMind into scientific domains and opens for over 100 task contributions by August 17, 2026 · Digg

Terminal-Bench Science extends the original Terminal-Bench benchmark used by Anthropic, OpenAI, and Google DeepMind into scientific domains and opens for over 100 task contributions by August 17, 2026 · Digg

Posts from X

Most Activity

VIEWS7.8KBOOKMARKS19

Thomas Wolf@Thom_Wolf

I'm very excited about this extension to the celebrated Terminal-Bench to science.

If you're a scientist (life, physical, earth, mathematical science, etc) interested in AI, definitely check this out!

Terminal bench evaluate how good AI models are at controling tools on a computer to achieve a goal (using the command line). T-Bench science now extends that to "AI for Science" and it comes with a call to contribute your own (real scientific world) workflow to the benchmark (until August 2026).

The more workflows and the more diverse they are, the better the next generation of AI models will be at helping you in your daily research work.

Note that this is not a training dataset, it's to evaluate frontier model performances.

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d7.8K4819

LIKES54

Lisan al Gaib@scaling01

let the hill climbing on scientific tasks begin

new benchmark: TerminalBench Science

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d5.4K546

RETWEETS103

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

41d889.8K469260

REPLIES9

Alex Dimakis@AlexGDimakis

Terminal-Bench Science is a direct way to contribute to AI for Science. It's programming agents by task specification. Ask a precise scientific question and watch how AI agents will learn to solve it: Step 1. Package a scientific task or workflow, something that takes a working scientist a week or month to do into an RL environment. Step 2. Write tests that verify if the task has been done correctly (can be done easily if you have already solved the task manually). Step 3. Sit back and let AI agent progress solve it in 6 months.

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d4.7K3717

Sanmi Koyejo@sanmikoyejo

"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it. http://tbench.ai/news/tb-science-announcement

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d2.2K1912

Steven Dillmann@StevenDillmann

Hosted by @Stanford @StanfordAILab @StanfordHAI, @LaudeInstitute and @harborframework.

With @lschmidt3, @sanmikoyejo, @AlexGDimakis, @bradenjhancock, @JJitsev, @ryanmart3n, @alexgshaw, @Mike_A_Merrill, @LinShi592021, @Krauth, @stebo85, @0xrobertzhang, @rishi_desai2, @ekellbuch, @xdotli, @realjustinbauer, @HeckelReinhard, @oq_35, @YuanqiD, @chenru_duan, @hcwww_, @gregd_nlp, @russpoldrack, @RisaWechsler, @SnorkelAI and a growing community of contributors and advisors across the natural sciences.

6/6

Steven Dillmann@StevenDillmann

⏰ Deadline: August 17, 2026 — the earlier you start, the more time we have to help your task land.

📋 Submit a task proposal: https://airtable.com/appzZC5gEHrXSfNNw/pagjgS95lAQ5FVJxt/form 💻 GitHub: https://github.com/harbor-framework/terminal-bench-science 💬 Discord (tb-science): https://discord.gg/ZvcWupVXjz 📅 Weekly Meeting (Mondays, 9am PT): https://meet.google.com/yco-yhwc-sid 📩 Contact: stevendi@stanford.edu

5/6

41d1.2K233

Steven Dillmann@StevenDillmann

Why contribute:

🎯 Make AI better at your science. Your tasks set what frontier labs optimize for. 🛠️ Gain agentic eval experience. See where today's best AI agents succeed and fail. 📝 Become a co-author. Every merged task earns authorship on the Terminal-Bench Science paper.

3/6

Steven Dillmann@StevenDillmann

Terminal-Bench Science is built by the scientific community to shape AI for science.

Most "AI for Science" benchmarks test textbook knowledge or contrived toy problems. We measure real computational workflows scientists run in practice.

Got a complex scientific workflow you wish an AI agent could handle? We want it.

2/6

41d1.6K173

Alex Ratner@ajratner

Extremely excited for Terminal-Bench Science, which we're proud to support via our Open Benchmarks Grants @SnorkelAI !

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d2.2K191

Steven Dillmann@StevenDillmann

Terminal-Bench Science is built by the scientific community to shape AI for science.

Most "AI for Science" benchmarks test textbook knowledge or contrived toy problems. We measure real computational workflows scientists run in practice.

Got a complex scientific workflow you wish an AI agent could handle? We want it.

2/6

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

41d1.7K171

Steven Dillmann@StevenDillmann

What a task looks like — real workflows from scientific domain experts:

- Reconstructing MRI brain maps - Virtual drug screening - Reconstructing ice crystal disorder over time

Scientifically grounded, programmatically verifiable, hard for today's best AI agents. Browse the full set on GitHub: https://github.com/harbor-framework/terminal-bench-science/tree/main/tasks

4/6

Steven Dillmann@StevenDillmann

Why contribute:

🎯 Make AI better at your science. Your tasks set what frontier labs optimize for. 🛠️ Gain agentic eval experience. See where today's best AI agents succeed and fail. 📝 Become a co-author. Every merged task earns authorship on the Terminal-Bench Science paper.

3/6

41d1.3K120

Steven Dillmann@StevenDillmann

If you know a scientist whose workflow belongs in Terminal-Bench Science — RT this thread or send it their way 🙏

Steven Dillmann@StevenDillmann

Hosted by @Stanford @StanfordAILab @StanfordHAI, @LaudeInstitute and @harborframework.

With @lschmidt3, @sanmikoyejo, @AlexGDimakis, @bradenjhancock, @JJitsev, @ryanmart3n, @alexgshaw, @Mike_A_Merrill, @LinShi592021, @Krauth, @stebo85, @0xrobertzhang, @rishi_desai2, @ekellbuch, @xdotli, @realjustinbauer, @HeckelReinhard, @oq_35, @YuanqiD, @chenru_duan, @hcwww_, @gregd_nlp, @russpoldrack, @RisaWechsler, @SnorkelAI and a growing community of contributors and advisors across the natural sciences.

6/6

41d845150

Chenhao Tan@ChenhaoTan

Science is the frontier of AI. Contribute to this initiative if you can!

Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

http://tbench.ai/news/tb-science-announcement

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

40d1.4K90

Lisan al Gaib@scaling01

it's currently still being built and you can submit your verifiable tasks until august 17th 2026

Lisan al Gaib@scaling01

let the hill climbing on scientific tasks begin

new benchmark: TerminalBench Science

40d1.2K60

Steven Dillmann@StevenDillmann

⏰ Deadline: August 17, 2026 — the earlier you start, the more time we have to help your task land.

📋 Submit a task proposal: https://airtable.com/appzZC5gEHrXSfNNw/pagjgS95lAQ5FVJxt/form 💻 GitHub: https://github.com/harbor-framework/terminal-bench-science 💬 Discord (tb-science): https://discord.gg/ZvcWupVXjz 📅 Weekly Meeting (Mondays, 9am PT): https://meet.google.com/yco-yhwc-sid 📩 Contact: stevendi@stanford.edu

5/6

41d524

Avijit Ghosh@evijit

@StevenDillmann @BlancheMinerva @AnthropicAI @OpenAI @GoogleDeepMind I literally put in my notes app this morning to look into science agents haha will dig in

40d2481

Steven Dillmann@StevenDillmann

@evijit @BlancheMinerva @AnthropicAI @OpenAI @GoogleDeepMind let's chat! @evijit

40d233

Daniel Lougen, M.S.@DJLougen

@StevenDillmann @AnthropicAI @OpenAI @GoogleDeepMind I posted this on a colleague of yours, but i was curious about the scope of the workflow tasks. I could see something like this being valuable for psychopy and jsPsych experiment creation.

40d1451

arifu.eth🦍@arif_only_

@StevenDillmann @AnthropicAI @OpenAI @GoogleDeepMind This is a great step for scientific research.

40d871

AI PlanetX@AI_PlanetX

@StevenDillmann @AnthropicAI @OpenAI @GoogleDeepMind A meaningful bridge to real science benchmarks.

40d801

Avijit Ghosh@evijit

@StevenDillmann @BlancheMinerva @AnthropicAI @OpenAI @GoogleDeepMind Let’s DM! 🫡

40d301