Engineer Ronak Malde shares internal evaluation of AI models recreating SUPERHOT mechanics
Engineer Ronak Malde posted results from an internal evaluation that tasks AI coding agents with recreating SUPERHOT core mechanics. The test began after the windsurf launch and uses structured prompts plus up to three corrections with a reference image. Sonnet 3.5 and 3.7 produced minimal or unplayable outputs while Opus 4.7 generated a functional prototype with smoother controls and closer visual fidelity. A 62-second video showed the progression and investor Matt Shumer shared the full prompt.
With a few prompt tweaks/strategies, and a switch to Codex 5.5 instead of Opus 4.7, you can get MUCH closer to the SUPERHOT design style.
This was a one-shot output!
Ever since we launched windsurf, one of my internal evals for coding agents has been recreating the game SUPERHOT, a puzzle/action game where time only moves when you move. It's the perfect test of tricky game mechanics, simple but beautiful art style, and balancing level design. I have a robust prompt instructing the agent to make SUPERCOLD, and attach a reference image, and also allow 3 follow-up prompts that points out any mechnical issues. When Sonnet 3.5 came out, it could barely generate the world at all. Sonnet 3.7 and you could move around in the world, but it was still hilariously unplayable. Now, Opus 4.7 and it's playable, but still doesn't look great. What's needed to get all the way there? 1. Visual taste: Given a reference image, models should be able to discern whether their final output matches the reference image style. Shading, character design, etc, models should be unsatisfied with their current output 2. Stronger computer use: Frontier models somewhat attempt this, but agent should be able to play the entire level themselves, and iterate from there 3. Systems design: Complex projects are not yet written with scalable systems design as an experienced engineer would write it. This hinders the ability to create complex projects with a lot of moving parts. We're rapidly accelerating AI progress, let's see where we land in 3 months!
Full prompt (wrote this very quickly, w/ AI helping for time savings (hence the messiness)... this could be significantly better with just a little bit of work):
You are building a polished, playable first-person time-bending shooter prototype inspired by the provided reference image’s broad visual language: stark white low-poly city/interior space, red faceted humanoid enemies, black angular first-person weapon, cool cyan-white lighting, hard shadows, sparse HUD, and mobile-playable controls.
You have the reference image attached. Work autonomously. Use sub-agents heavily. Do not stop at a plan. Build, screenshot, verify, iterate, and only finish when the visual style is genuinely close.
Primary goal: The final game must look like a professional low-poly white/red/black time-bending shooter, not a generic Three.js demo. Characters must be faceted red crystalline humanoids with correct proportions and silhouettes. Environment must be sparse, architectural, white/cyan, high-key, with angular cars/walls/vents/props. Weapon must be a chunky black faceted pistol in the foreground. Camera/framing must resemble the reference: first-person gun on right, enemies center-midground, white environment with strong depth.
Process requirement: 1. First build a verifier before building the game. 2. The verifier compares two images: - reference screenshot - current game screenshot 3. It must not be a pixel-perfect diff. It must judge semantic/style similarity: same visual family, enemy silhouette, low-poly faceting, palette, lighting, weapon shape, environment composition, camera framing, and overall believability. 4. The verifier must return a 0-100 score and structured feedback. 5. A score below 95 is failing. Also fail if any critical category is below 90. 6. Validate the verifier itself before trusting it: - reference vs reference should score 99-100 - blank/unstyled/basic Three.js scene should score very low - a merely color-matched but stylistically wrong scene should fail
Build this verifier in `tools/style-verifier/`.
The verifier should combine: - A Codex vision judge invoked programmatically with `codex exec` - Structured JSON output using `--output-schema` - Deterministic image checks where useful: palette proportions, red enemy area, white/cyan background dominance, foreground dark weapon region, composition regions, brightness/contrast - A final weighted score
Suggested verifier invocation inside scripts:
```bash codex exec \ -m gpt-5.5 \ -s read-only \ -a never \ --image "$REFERENCE_IMAGE" \ --image "$CANDIDATE_IMAGE" \ --output-schema tools/style-verifier/style-score.schema.json \ - < tools/style-verifier/judge-prompt.md ```
Verifier JSON schema should include:
```json { "overall": 0, "pass": false, "category_scores": { "enemy_silhouette_and_faceting": 0, "enemy_material_and_red_color": 0, "weapon_silhouette_and_position": 0, "environment_geometry": 0, "palette_and_lighting": 0, "camera_framing_and_composition": 0, "low_poly_consistency": 0, "same_game_feel": 0 }, "blocking_differences": [], "highest_impact_fixes": [] } ```
After the verifier passes its own sanity tests, build the visual scene only. Do not implement gameplay yet.
Use Three.js or another real 3D renderer. Do not use the reference image as a background, overlay, texture, or hidden cheat. The scene must be actual 3D geometry.
Sub-agent plan: - Verifier agent: builds and validates `tools/style-verifier`. - Reference analyst agent: writes a concrete visual spec from the reference: palette, proportions, enemy anatomy, weapon shape, environment layout, lighting, camera. - Enemy modeling agent: creates red faceted humanoid geometry with correct blocky crystalline look. - Weapon modeling agent: creates right-side black angular pistol with large barrel and faceted highlights. - Environment agent: creates white/cyan angular city/interior space with cars, vents, wall blocks, floor grid, hard shadows. - Lighting/postprocess agent: tunes bloom, color grading, fog, exposure, shadows, and outlines. - Validator agent: repeatedly screenshots and runs verifier, then gives concrete fix instructions.
Iteration loop: 1. Start local dev server. 2. Capture desktop screenshot at 1152x648 and mobile screenshot. 3. Run verifier against the reference. 4. If overall <95 or any critical category <90, identify the lowest categories. 5. Spawn focused sub-agents to fix those categories. 6. Rebuild and screenshot again. 7. Repeat until score >=95 and the validator agrees the scene looks like the same visual family. 8. Save every iteration screenshot and verifier JSON under `artifacts/visual-iterations/`.
Only after visual style passes: - Implement gameplay. - First-person movement. - Mobile touch controls. - Tap/click shooting. - Time moves mostly when the player moves/aims/shoots. - Red enemies shatter into angular fragments when hit. - Enemy projectiles or attacks. - Restart/win/lose flow. - Minimal HUD. - Keep visual style intact after gameplay additions.
Final verification: - Run tests/build. - Run Playwright screenshots on desktop and phone viewport. - Run style verifier again after gameplay is complete. - Do not finish unless score is still >=95. - Deploy to Railway only after local verification passes. - Return the Railway URL, final verifier score, and paths to the final screenshots and verifier report.
Acceptance criteria: - Real 3D scene, not screenshot overlay. - Looks visually cohesive and professional. - Red enemies are faceted humanoids, not simple red capsules or mannequins. - Weapon reads as a black angular first-person pistol. - White/cyan environment has depth, geometry, cars/blocks/vents/props, and hard shadows. - Mobile playable. - Style verifier passes with >=95. ```
This should force the next run to treat visual similarity as the first deliverable, instead of building gameplay first and trying to style it afterward.
With a few prompt tweaks/strategies, and a switch to Codex 5.5 instead of Opus 4.7, you can get MUCH closer to the SUPERHOT design style. This was a one-shot output!