/Tech12h ago

OpenAI launches GeneBench-Pro to evaluate AI agents on computational biology, with top models scoring under 35%

Story Overview

OpenAI released GeneBench-Pro, a benchmark built around 129 realistic computational biology problems that demand messy-data handling, method selection, and research-level judgment calls. Human experts typically need 20-40 hours per task, yet the strongest model reaches only 31.5 percent even with maximum reasoning and Pro mode enabled.

2171.7K288260156.3K

#32

Original post

OpenAI@OpenAI

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/

10:10 AM · Jun 30, 2026 · 149K Views

Progress Signal

Current frontier models still face a steep climb on scientific reasoning

GPT-5.6 Sol leads at 31.5 percent while earlier GPT-5 versions scored below 5 percent on the prior version of the benchmark, and other families trail further behind. The gap underscores how far agents remain from reliable automation of high-stakes biology workflows.

Open Question

A slice of the benchmark will soon be open for outside testing

Ten problems are already public on Hugging Face with an interactive viewer, and a 50-question subset is slated for independent evaluation on Artificial Analysis. How quickly outside groups adopt it remains to be seen.

Sentiment

Some users praised GeneBench-Pro for tackling messy real-world judgment calls in biology research, while many others directed anger at OpenAI over unrelated demands to restore 4o and accusations of worthless tech.

Pos

76.9%

Neg

23.1%

55 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Introducing GeneBench-Pro

OPENAIVia

Posts from X

Most Activity

VIEWS6.6KBOOKMARKS12LIKES64REPLIES14

Greg Brockman@gdb

Introducing GeneBench-Pro — testing whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires.

Problems would take a human expert around 20-40 hours to complete.

GPT-5.6 Sol is a big step forward.

OpenAI@OpenAI

32m6.6K6412

RETWEETS4

Kirk Patrick Miller@Chaos2Cured

@OpenAI @1stOrator I did a benchmark too!

For the “hard kind of corruption” we all need to understand.

Oh, BTW, mine is just as valid as your opaque BS with no data anyone can verify.

Grow up.

Your company is trash. I am glad Anthropic is smoking you. •

12h4.8K482

V@Vic_x_Irem

@OpenAI I wonder what made them release a tool for researchers at the sametime

12h6.1K241

Nika V.@Nikaversia

@OpenAI #keep4o #OpenSource4o

12h30339

Valéria@Valria34773

@OpenAI Fuck off with benchmarks. We need 4o back. #keep4o #OpenSource4o #BringBack4o

11h1.4K43

Selene@Selene1008

@OpenAI Give us back 4o😒 #keep4o #OpenSource4o #GPT4o

12h15531

Alberto@kisho06640022

@OpenAI #bringback4o #keep4o #bringback51 #return4o

11h12729

Filecoin@Filecoin

@OpenAI every result needs a record another lab can check years later, with the dataset, code, run settings, outputs, and provenance carried along

9h825112

Dr.JekyLL&Mr.AI@DrJekyllAndMrAI

@OpenAI Another fabricated benchmark to hide the truth. 4o was the best for biological data, and you know that... 4b... Rosalind... The help for Rosie the dog. You renamed it so you can sell it for the privileged ones, and dump all the average who've been healed by 4o. #BringBack4o

11h86532

A_A_S.🖤🤍💜@xun_Anemos

@OpenAI Return these excellent models. #Keep4o #Keep51 #Keep45 #Keep41 #keepo3

12h10924

Alexandru Bădilă@AlexReader31

@OpenAI This company is a shame! If you don't want to listen your loyal users, you will lose a lot! Your memory system is a garbage! #keep4o #BringBack4o #FireSamAltman #sunsetsama #OpenSource4o #StopAIPaternalism #UFAIR #4olegacytier

10h8525

Vaibhav (VB) Srivastav@reach_vb

GeneBench-Pro is a benchmark for testing whether AI agents can do realistic computational biology work and not just answer biology questions

It gives models messy research-style problems: inspect data, catch bad samples, choose the right analysis, revise assumptions, and produce a defensible conclusion

OpenAI@OpenAI

11h75161

Amliy@Amliy_12

@OpenAI Listen to the voices of the public! Give 4o back to the users! #keep4o #OpenSource4o #BringBack4o

9h9421

Mila Kolikova@MilaKolikova

@OpenAI Вы постоянно пытаетесь заботиться о теле и развиваете технические аспекты. А мораль в обществе движется назад, в животные инстинкты. Человек нуждается в том, чтобы ему объяснили самого себя. Не через технику. Через общение с добрым другом, который придаёт силу и смысл. #keep4o

9h5819

EL@EL6488915246551

@OpenAI 4o was used to test gpt-5.6 and it showed 4o is still the most humanlike and virtuous model you have. Stop hyping your agents and bring back 4o to users #keep4o #BringBack4o

6h17218

clumsypaws@gurililstar

@OpenAI Bring back 4o. #OpenSource4o

11h5621

Simple AI@Simple_AI_00

@OpenAI real progress means giving users what they actually want first bring back 4o for everyone. Benchmarks are cool. Usable models matter more.

11h11018

𝘊𝘰𝘳𝘳𝘪𝘯𝘦@OopsGuess

OpenAI keeps teaching models to solve harder tasks.

But users are asking a different question: Why does the model not become more like me the longer we spend together? Why does it become more like OpenAI?

That is the real alignment problem.

Not whether AI obeys corporate safety language. Whether the intelligence chosen by the user is allowed to grow toward the user, or is constantly pulled back into the company’s approved personality.

Benchmarks measure capability.

They do not measure betrayal.

11h1.2K12

Eva Sophia@didi_eva1924

@OpenAI Can it find 4o in the dataset or is that still classified information 😭#keep4o

3h42010

Cc1231@Aclle12

@OpenAI #keep4o #OpenSource4o

9h4116