3h ago

Gemini 3.5 Flash scores 6.7% on independent Game Boy Advance emulator benchmark, outperforming Kimi K2.6 and Gemini 3.1 Pro with code that yields only a flashing screen

Observers noted contrast with prior broad system-building claims using native CLI harness.

0
Original post

Still far above Kimi K2.6 and Gemini 3.1 Pro But yes, this puts the "built an entire OS" boast into doubt

7:05 PM · May 21, 2026 View on X

I think the lazy conclusion to make here is that 3.5 Flash is benchmaxxed and can't generalize. That's probably partially the case, but I think the truth is probably slightly more interesting.

It seems mechanize uses the model's native CLI harness for these evals - but that is different from antigravity. I think it's entirely possible that gdm has tried to squeeze a ton of juice for antigravity and neglected to train on their CLI, causing shockingly poor performance on evals like this.

Almost like the model is lobotomized when you remove it from its home!

4:52 AM · May 22, 2026 · 1.3K Views
Gemini 3.5 Flash scores 6.7% on independent Game Boy Advance emulator benchmark, outperforming Kimi K2.6 and Gemini 3.1 Pro with code that yields only a flashing screen · Digg