o3-mini-high Exposes Bugs in Sakana AI CUDA Kernels Claiming 150x Speedup

VIEWS556BOOKMARKS2

Now let’s talk about the AI scientist saga:

As soon as people started using the system it became clear that the claims about it and the examples they released were not remotely grounded in the model’s actual abilities. Here’s one example: https://arxiv.org/abs/2502.14297v3

Stella Biderman@BlancheMinerva

Lucas asked o1 to fix the bugs in their core and found it was immediately able to do so, but resulted in showing that the method was actually 3x slower rather than 150x faster. Again, this is for an eval that was obviously bugged because it was reporting impossible results.

3h55632

LIKES4REPLIES1

Stella Biderman@BlancheMinerva

They later claimed that the system “passed the peer-review process” when what really happened was one of the three papers they submitted received reviews that were marginally above the acceptance threshold. We don’t actually know if it would have passed

https://sakana.ai/ai-scientist-first-publication/

Stella Biderman@BlancheMinerva

These findings can’t be squared with the claims Sakana made in their marketing materials such as “near human accuracy” in reviewing papers (when tested on 10 papers, it had a 50% precision, 20% recall, and 28.6% F1-score) or the ability to write and run code without human input.

2h49440

Stella Biderman@BlancheMinerva

These findings can’t be squared with the claims Sakana made in their marketing materials such as “near human accuracy” in reviewing papers (when tested on 10 papers, it had a 50% precision, 20% recall, and 28.6% F1-score) or the ability to write and run code without human input.

Stella Biderman@BlancheMinerva

Now let’s talk about the AI scientist saga:

As soon as people started using the system it became clear that the claims about it and the examples they released were not remotely grounded in the model’s actual abilities. Here’s one example: https://arxiv.org/abs/2502.14297v3

3h45840

Stella Biderman@BlancheMinerva

peer review because they withdrew the paper before it got an accept or reject decision. So maybe its 43rd percentile score would have gotten in. Maybe it wouldn’t. But claiming it passed peer review is just blatantly false.

The paper, by the way, is hot garbage.

Stella Biderman@BlancheMinerva

They later claimed that the system “passed the peer-review process” when what really happened was one of the three papers they submitted received reviews that were marginally above the acceptance threshold. We don’t actually know if it would have passed

https://sakana.ai/ai-scientist-first-publication/

2h44740

Stella Biderman@BlancheMinerva

Another reason to believe that their AI scientist claims are non-sense: Sakana hasn’t done much good work. If they really had an AI scientist over a year ago, surely they’d have been able to publish important technical research using said scientist!

Stella Biderman@BlancheMinerva

peer review because they withdrew the paper before it got an accept or reject decision. So maybe its 43rd percentile score would have gotten in. Maybe it wouldn’t. But claiming it passed peer review is just blatantly false.

The paper, by the way, is hot garbage.

2h48630

Stella Biderman@BlancheMinerva

I recall there being a major update to the AI Scientist that was also separately quite bad from the v1, but as most of the criticism was on twitter and many of Sakana’s tweets have been deleted, I’m having trouble finding anything I can strongly stand behind claiming is wrong.

Stella Biderman@BlancheMinerva

Another reason to believe that their AI scientist claims are non-sense: Sakana hasn’t done much good work. If they really had an AI scientist over a year ago, surely they’d have been able to publish important technical research using said scientist!

2h43210