16h ago

Terminal-Bench Science extends the original Terminal-Bench benchmark used by Anthropic, OpenAI, and Google DeepMind into scientific domains and opens for over 100 task contributions by August 17, 2026

Contributors package workflows as RL environments with verification tests.

0
Original post

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

10:00 AM · May 20, 2026 View on X
Reposted by

I'm very excited about this extension to the celebrated Terminal-Bench to science.

If you're a scientist (life, physical, earth, mathematical science, etc) interested in AI, definitely check this out!

Terminal bench evaluate how good AI models are at controling tools on a computer to achieve a goal (using the command line). T-Bench science now extends that to "AI for Science" and it comes with a call to contribute your own (real scientific world) workflow to the benchmark (until August 2026).

The more workflows and the more diverse they are, the better the next generation of AI models will be at helping you in your daily research work.

Note that this is not a training dataset, it's to evaluate frontier model performances.

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
5:47 PM · May 20, 2026 · 6.7K Views

Terminal-Bench Science is a direct way to contribute to AI for Science. It's programming agents by task specification. Ask a precise scientific question and watch how AI agents will learn to solve it: Step 1. Package a scientific task or workflow, something that takes a working scientist a week or month to do into an RL environment. Step 2. Write tests that verify if the task has been done correctly (can be done easily if you have already solved the task manually). Step 3. Sit back and let AI agent progress solve it in 6 months.

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
1:04 AM · May 21, 2026 · 3.4K Views

Science is the frontier of AI. Contribute to this initiative if you can!

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
1:24 AM · May 21, 2026 · 894 Views

let the hill climbing on scientific tasks begin

new benchmark: TerminalBench Science

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
9:25 PM · May 20, 2026 · 4.2K Views

it's currently still being built and you can submit your verifiable tasks until august 17th 2026

Lisan al GaibLisan al Gaib@scaling01

let the hill climbing on scientific tasks begin new benchmark: TerminalBench Science

9:25 PM · May 20, 2026 · 4.2K Views
9:26 PM · May 20, 2026 · 1.1K Views

Extremely excited for Terminal-Bench Science, which we're proud to support via our Open Benchmarks Grants @SnorkelAI !

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
9:15 PM · May 20, 2026 · 1.9K Views

"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it. http://tbench.ai/news/tb-science-announcement

Steven DillmannSteven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

5:00 PM · May 20, 2026 · 849.1K Views
5:09 AM · May 21, 2026 · 376 Views
Terminal-Bench Science extends the original Terminal-Bench benchmark used by Anthropic, OpenAI, and Google DeepMind into scientific domains and opens for over 100 task contributions by August 17, 2026 · Digg