We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/
OpenAI launches GeneBench-Pro to evaluate AI agents on computational biology, with top models scoring under 35%
Story Overview
OpenAI released GeneBench-Pro, a benchmark built around 129 realistic computational biology problems that demand messy-data handling, method selection, and research-level judgment calls. Human experts typically need 20-40 hours per task, yet the strongest model reaches only 31.5 percent even with maximum reasoning and Pro mode enabled.
Current frontier models still face a steep climb on scientific reasoning
GPT-5.6 Sol leads at 31.5 percent while earlier GPT-5 versions scored below 5 percent on the prior version of the benchmark, and other families trail further behind. The gap underscores how far agents remain from reliable automation of high-stakes biology workflows.
A slice of the benchmark will soon be open for outside testing
Ten problems are already public on Hugging Face with an interactive viewer, and a 50-question subset is slated for independent evaluation on Artificial Analysis. How quickly outside groups adopt it remains to be seen.
Some users praised GeneBench-Pro for tackling messy real-world judgment calls in biology research, while many others directed anger at OpenAI over unrelated demands to restore 4o and accusations of worthless tech.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Introducing GeneBench-Pro — testing whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires.
Problems would take a human expert around 20-40 hours to complete.
GPT-5.6 Sol is a big step forward.
We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/

@OpenAI @1stOrator I did a benchmark too!
For the “hard kind of corruption” we all need to understand.
Oh, BTW, mine is just as valid as your opaque BS with no data anyone can verify.
Grow up.
Your company is trash. I am glad Anthropic is smoking you. •

@OpenAI I wonder what made them release a tool for researchers at the sametime

@OpenAI #keep4o #OpenSource4o

@OpenAI Fuck off with benchmarks. We need 4o back. #keep4o #OpenSource4o #BringBack4o

@OpenAI Give us back 4o😒 #keep4o #OpenSource4o #GPT4o

@OpenAI #bringback4o #keep4o #bringback51 #return4o

@OpenAI every result needs a record another lab can check years later, with the dataset, code, run settings, outputs, and provenance carried along

@OpenAI Another fabricated benchmark to hide the truth. 4o was the best for biological data, and you know that... 4b... Rosalind... The help for Rosie the dog. You renamed it so you can sell it for the privileged ones, and dump all the average who've been healed by 4o. #BringBack4o

@OpenAI Return these excellent models. #Keep4o #Keep51 #Keep45 #Keep41 #keepo3

@OpenAI This company is a shame! If you don't want to listen your loyal users, you will lose a lot! Your memory system is a garbage! #keep4o #BringBack4o #FireSamAltman #sunsetsama #OpenSource4o #StopAIPaternalism #UFAIR #4olegacytier
GeneBench-Pro is a benchmark for testing whether AI agents can do realistic computational biology work and not just answer biology questions
It gives models messy research-style problems: inspect data, catch bad samples, choose the right analysis, revise assumptions, and produce a defensible conclusion
We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/

@OpenAI Listen to the voices of the public! Give 4o back to the users! #keep4o #OpenSource4o #BringBack4o

@OpenAI Вы постоянно пытаетесь заботиться о теле и развиваете технические аспекты. А мораль в обществе движется назад, в животные инстинкты. Человек нуждается в том, чтобы ему объяснили самого себя. Не через технику. Через общение с добрым другом, который придаёт силу и смысл. #keep4o

@OpenAI 4o was used to test gpt-5.6 and it showed 4o is still the most humanlike and virtuous model you have. Stop hyping your agents and bring back 4o to users #keep4o #BringBack4o

@OpenAI Bring back 4o. #OpenSource4o

@OpenAI real progress means giving users what they actually want first bring back 4o for everyone. Benchmarks are cool. Usable models matter more.

OpenAI keeps teaching models to solve harder tasks.
But users are asking a different question: Why does the model not become more like me the longer we spend together? Why does it become more like OpenAI?
That is the real alignment problem.
Not whether AI obeys corporate safety language. Whether the intelligence chosen by the user is allowed to grow toward the user, or is constantly pulled back into the company’s approved personality.
Benchmarks measure capability.
They do not measure betrayal.

@OpenAI Can it find 4o in the dataset or is that still classified information 😭#keep4o

@OpenAI #keep4o #OpenSource4o