Ethan Mollick and Peter Steinberger find Sakana AI's Fugu Ultra struggles on complex coding tasks, lagging behind GPT-5.5

Original post

Peter Steinberger 🦞@steipete#382inTech

I was skeptical about the multi-model routing. Seems my hinch was right.

am.will@LLMJunky

I tried this so you don't have to.

I know this is going to absolutely shock you but no this does not match the performance of Mythos.

A few early thoughts:

1. The limits are pretty bad. I used 100% of my 5-hour usage in less than 1 prompt.

2. I specifically gave it a threejs task because it is an area that SOTA models have made big strides in, that other models just are not great at.

I asked it to build a replica of Rocket League. I'll put the prompt in the comments.

The game was pretty bad and notably worse than GPT 5.5.

Even after multiple fixes, it took 7-8 back and forth with Codex just to get it an almost playable condition. Prior to these fixes, the game was not playable.

Maybe it's really strong in other disciplines. I'd love to test that but I hit my limit in 1 prompt lol.

GPT 5.5 by contrast did a pretty good job and required no follow ups. Fable would have absolutely nailed this as well.

But yeah, early impressions...not great. But I hope I'm wrong. More testing tomorrow.

12:31 AM · Jun 22, 2026 · 157.6K Views

Harbor Town AI Progress

AI-HARBOR-TOWN-GALLERY.NETLIFY.APPVia

#184

VIEWS27.1KBOOKMARKS41LIKES208RETWEETS12REPLIES16

Ethan Mollick@emollick

I have been trying Sakana Fugu Ultra-high and, first, it is incredibly slow: my typical coding tests (shaders, interactive scenes) take 30 minutes to run

And the results are... fine. It does not match Fable in real use.

Its harbor is a good example: https://ai-harbor-town-gallery.netlify.app/#sakura-ultra-high

Sakana AI@SakanaAILabs

Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API.

Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls.

Try it: https://sakana.ai/fugu 🐡

3h27.1K20841

Peter Steinberger 🦞@steipete

@LLMJunky I wonder how this even works, as soon as you mix models you lose the hidden reasoning context, trash the cache and overall get a worse result?

13h3.1K391

am.will@LLMJunky

gotta say, not impressed. maybe my test is not the greatest, but I actually feel its rather representative.

good models do really well in these types of tests, and its been one of those things evals you can do with a new model release and see an immediate visual difference when you do comparisons

its something you can just immediately see, feel, and in this case, play.

if a model was really as good as fable, at coding, then i'd expect it to be able to follow directions, get the physics right, and build a playable game.

13h2.6K112

Symbioza2025 | ASA |@Symbioza2025

This does not prove multi-model routing is wrong.

It may prove that routing is now part of the intelligence problem. In real-time tasks, the system is not only judged by model quality. It is judged by continuity, latency, state tracking, and when the router decides to switch.

A bad router can make good models look dumb.

7h11432

Mark Santos@markksantos

@steipete Opus 4.8 vs Fugu Ultra:

12h27921

Avery@wveriy

@steipete It's always the hinch.

10h26711

Trifon Getsov@trifon_getsov

@steipete called it earlier today, "matches fable/mythos" only holds up when fable/mythos aren't in the room to disagree

12h8198

Philo Groves@PhiloGroves

@steipete Multi-model only works if the different models have different capabilities. It is the difference between passing a shared artwork around a school classroom vs. running an animation shop.

12h3022

Kevin@kevincodex

@steipete ohhh

13h5005

am.will@LLMJunky

@steipete I wish I knew. I'll try some more tests tomorrow and report back either way, good or bad.

12h8184

X-Dimension@X_DimensionNews

@steipete its good for web search. tried with a niche product. no other good uses at the moment on my side

12h367

Ethan Mollick@emollick

@srikarsmile That would beat frontier LLMs, which it does not.

2h961

am.will@LLMJunky

@emollick Yeah, it was not great in my testing

2h7423

Dr. Bobby Gomez-Reino@BobbyGRG

@steipete im quite sure it is. just not any architecture. I can't prove it yet but working on doing so

8h169

tsunami_crypto@ls_brd

@steipete hinch was right and the usage limit thing is brutal

just called the CEO a weirdo in one prompt

13h6372

Hahn@BayernHahn

@steipete In Germany we say, too many cooks spoil the broth

12h5362

Ethan Mollick@emollick

TiKZ unicorn

3h1.4K1

Srikar Reddy@srikarsmile

@emollick they said its combination of LLms

3h40

BannedLatino@BannedLatino

@X_DimensionNews @steipete Expensive web search. Lmao

7h8

rbbydotdev@rbbydotdev

@steipete Ya how does cache carry over? Or maybe I’m naive and missing something, also even the context between different models with different training sets sounds chaotic

8h2511