/AI3h ago

AI Science Models Fail Basic Benchmark Tasks 20 Percent of Time

1130583

Original post

AI:AM@AI_in_the_AM

"ask them to boil water and they can't do it 20% of the time"

Peter Jansen, Research Scientist at Ai2, says current AI science models still fail basic benchmark tasks far too often.

"They're really terrible at that"

"you gotta pay attention to all the really simple ways that they break"

@peterjansen_ai

7:10 AM · Jun 10, 2026 · 583 Views

/AI3h ago

1130583

Original post

AI:AM@AI_in_the_AM

"ask them to boil water and they can't do it 20% of the time"

Peter Jansen, Research Scientist at Ai2, says current AI science models still fail basic benchmark tasks far too often.

"They're really terrible at that"

"you gotta pay attention to all the really simple ways that they break"

@peterjansen_ai

7:10 AM · Jun 10, 2026 · 583 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

AI:AM@AI_in_the_AM

Follow @AI_in_the_AM for the daily rundown!

3h7