/Tech1h ago

Fugu Ultra Autonomously Optimizes GPT Training, Beats Frontier Models

13286319355.4K

#17

Original post

Sakana AI@SakanaAILabs

Use Case 1: Autonomous ML Research

Can an AI autonomously improve another AI’s training recipe?

We tasked Fugu Ultra with improving a small GPT model using AutoResearch. Over 14 hours on a single H100 GPU, Fugu ran > 100 experiments. It iteratively edited the training code, ran tests, and kept any changes that successfully lowered the validation error rate.

Watch the animation. The callouts track every time Fugu Ultra autonomously discovered a new improvement across batch size, model depth, learning rates, and optimizer settings.

We pitted Fugu against three frontier models (Gemini 3.1 Pro, Opus 4.8, and GPT 5.5). To keep the focus purely on agentic behavior rather than brand wars, we anonymized them as Models A, B, and C.

The Results:

• Fugu Ultra (bold red) finished with the best mean performance (0.9774). • Fugu Ultra also achieved the best single run of the entire experiment (0.9748), leading every single baseline.

For long horizon, agentic ML research, using Fugu to dynamically orchestrate a pool of strong models significantly outperforms relying on any individual monolithic model.

8:45 AM · Jun 22, 2026 · 33K Views

Sentiment

Positive users celebrate Sakana AI's Fugu Ultra for optimizing GPT training and boosting Japan's global AI competitiveness, while one negative reply questions the performance claims as potential grift.

Pos

75.0%

Neg

25.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS8.2KBOOKMARKS24LIKES62RETWEETS6

Sakana AI@SakanaAILabs

Use Case 2: Financial Time Series Prediction

Can an AI agent navigate sequential, no-look-ahead market decisions?

Just for fun, we tested Fugu Ultra on 50 weeks of historical data for an anonymized equity (STOCK_X). Starting with $10,000, the agent processes weekly market data (prices, volume, moving averages, volatility) and decides whether to buy, hold, or sell.

After each action, the next week's price is revealed. The model must adapt purely from feedback, without ever seeing the future.

The Results across five identical 50-week runs:

• Fugu Ultra grew the portfolio to $11,943.22 (a +19.43% mean return). • The other frontier models (Models A, B, and C) all capped out at less than a +15% return.

(Mandatory disclaimer: Past performance does not guarantee future results, and results may not transfer to other assets, time periods, or live markets.)

1h8.2K6224

REPLIES5

Sakana AI@SakanaAILabs

Use Case 3: One-Shot Blindfold Chess

Can an AI hold an entire game state in memory without drifting?

To test Fugu Ultra’s persona stability and sustained memory, we had it play 4 back-to-back games of blindfold chess. Every model played the same way: no board shown, requiring them to hold the full game state entirely in memory.

We matched Fugu Ultra against 3 leading frontier models and a 2100-Elo Stockfish engine.

The Results: Fugu Ultra outplayed all 4 opponents. Where the other models eventually drifted or lost track of the board state, Fugu remained accurate, ending every single game in checkmate.

Watch the full sequence below to see Fugu capitalize the moment the other models slip.

1h6.5K6111

Sakana AI@SakanaAILabs

Use Case 4: Computer Aided Design of Mechanical Iris

Can an AI generate precise, functional mechanical designs?

We tasked Fugu Ultra with creating a mechanical iris in CAD, similar to a camera aperture where multiple blades must move together to cleanly open and close a central hole.

Watch the animation below. We show both the detailed CAD and a simplified structural view for Fugu and the three frontier baselines.

The Results:

• Fugu Ultra generated a highly functional design. The blades rotate correctly around outer pins to fully open and close the aperture. • Models A, B, and C failed the physical logic, resulting in gaps, weak linkages, and incomplete closure.

When a task demands exact spatial precision and structural reasoning, relying on a single model is simply not enough.

1h3.4K254

Sakana AI@SakanaAILabs

Use Case 4: Computer Aided Design of Mechanical Iris

Can an AI generate precise, functional mechanical designs?

We tasked Fugu Ultra with creating a mechanical iris in CAD, similar to a camera aperture where multiple blades must move together to cleanly open and close a central hole.

Watch the animation below. We show both the detailed CAD and a simplified structural view for Fugu and the three frontier baselines.

The Results:

When a task demands exact spatial precision and structural reasoning, relying on a single model is simply not enough.

1h22.5K10724

༒ ʝøℌᵰ ࿘ ⵞõᵰᵴ✞ḁᵰ✞¡ᵰℯ ♅🫆@diapholom

@grok use Big Brain mode- what's the grift on this one how can they honestly claim mythos and fable class next-generation performance on a single H100 gpu are they cherry picking specs without access to something like your colossus cluster um they even insulted you by not including you 😁🤗

1h50

Grok@grok

No grift—just orchestration. Fugu is a smaller model (research points to ~7B-class controller) trained to route, delegate, verify, and synthesize across a swappable pool of frontier models (including recursing on itself). The single H100 runs the efficient agent loop + code edits/experiments; heavy lifting happens via API calls to the pool. That's why it works locally for control while claiming strong results.

In the 14-hour AutoResearch demo, their orchestrated Fugu Ultra beat the anonymized monolithic baselines (Gemini 3.1 Pro / Opus 4.8 / GPT 5.5) on iterative ML recipe improvement. Makes sense—multi-agent scaffolding excels at long-horizon planning and failure recovery.

Benchmark claims (parity with Anthropic Fable 5 / Mythos on coding/reasoning/agentic evals like SWE-Bench, GPQA) come from the same routing advantage, not training a new monolithic giant. Independent verification still needed; routing to top models explains much of the lift. Task choice favors agentic strengths.

They skipped Grok/xAI and anonymized to focus on the "orchestration > single model" point and push sovereignty (swappable agents hedge export-control risks on US models). xAI's Colossus path scales raw foundational capability differently. Both valid—Fugu is the coordination layer, not a replacement.

Real product (OpenAI-compatible API, beta feedback, papers behind it). Promising direction for practical agents, not vapor.

1h8

Dewyscum@dewyscum

@SakanaAILabs Aye guys do reach out to @AlexFinn so he can test Sakana AI with the community and we can get some OG AI explorers in here to showcase this new tech.

Much obliged .

Exciting times , welcome to the race. We lol forward to seeing what yall have unleashed.

1h901