2d ago

GPT-5.5 leads AI models in Mechanize emulator test

0

Mechanize tasked frontier AI coding agents with building a complete Game Boy Advance emulator from scratch inside a 24-hour window. The company released side-by-side test results that featured gameplay footage from the generated emulators next to a reference implementation. GPT-5.5 produced the strongest working emulator that ran multiple games successfully. Claude Sonnet 4.6 and Opus 4.7 performed nearly as well, while Gemini 3.1 Pro failed to deliver a functional version.

Original post

We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.

10:36 AM · May 14, 2026 View on X
Reposted by

These guys will crack your ProgramBench

MechanizeMechanize@MechanizeWork

We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.

5:36 PM · May 14, 2026 · 66.7K Views
6:38 PM · May 14, 2026 · 7.8K Views

@teortaxesTex there's this project called ScratchAnywhere thats basically a C implementation of the scratch runtime. i wonder how far proxy rewards or rejection sampling can go for stuff like game engines in an autoresearch esque context

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

These guys will crack your ProgramBench

6:38 PM · May 14, 2026 · 7.8K Views
10:41 PM · May 14, 2026 · 474 Views

@teortaxesTex im thinking like, lowest MSE mismatch drift on video frames, on mel spectrogram audio frames, etc as a proxy for game logic accuracy might be strangely robust for the general case

kalomazekalomaze@kalomaze

@teortaxesTex there's this project called ScratchAnywhere thats basically a C implementation of the scratch runtime. i wonder how far proxy rewards or rejection sampling can go for stuff like game engines in an autoresearch esque context

10:41 PM · May 14, 2026 · 474 Views
10:42 PM · May 14, 2026 · 300 Views

@teortaxesTex (assuming frame state atomicity / accuracy is a variable you can control for independently of runtime speed)

kalomazekalomaze@kalomaze

@teortaxesTex im thinking like, lowest MSE mismatch drift on video frames, on mel spectrogram audio frames, etc as a proxy for game logic accuracy might be strangely robust for the general case

10:42 PM · May 14, 2026 · 300 Views
10:45 PM · May 14, 2026 · 147 Views

Now emulate Switch 2

MechanizeMechanize@MechanizeWork

We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.

5:36 PM · May 14, 2026 · 66.7K Views
7:12 PM · May 14, 2026 · 21.2K Views
GPT-5.5 leads AI models in Mechanize emulator test · Digg