Lucas asked o1 to fix the bugs in their core and found it was immediately able to do so, but resulted in showing that the method was actually 3x slower rather than 150x faster. Again, this is for an eval that was obviously bugged because it was reporting impossible results.
o3-mini-high figured out the issue with @SakanaAILabs CUDA kernels in 11s. It being 150x faster is a bug, the reality is 3x slower.
I literally copy-pasted their CUDA code into o3-mini-high and asked "what's wrong with this cuda code". That's it! Proof: https://chatgpt.com/share/67b6f47c-4b30-8001-b7d8-d3dddc313676
Fig1: o3-mini's answer. Fig2: Their orig code is wrong in subtle way. The fact they run benchmarking TWICE with wildly different results should make them stop and think. Fig3: o3-mini's fix. Code is now correct. Benchmarking results are consistent. 3x slower.