Gemini 3.5 Flash scores 6.7% on independent Game Boy Advance emulator benchmark, outperforming Kimi K2.6 and Gemini 3.1 Pro with code that yields only a flashing screen
Observers noted contrast with prior broad system-building claims using native CLI harness.
I think the lazy conclusion to make here is that 3.5 Flash is benchmaxxed and can't generalize. That's probably partially the case, but I think the truth is probably slightly more interesting.
It seems mechanize uses the model's native CLI harness for these evals - but that is different from antigravity. I think it's entirely possible that gdm has tried to squeeze a ton of juice for antigravity and neglected to train on their CLI, causing shockingly poor performance on evals like this.
Almost like the model is lobotomized when you remove it from its home!