Use Case 1: Autonomous ML Research
Can an AI autonomously improve another AI’s training recipe?
We tasked Fugu Ultra with improving a small GPT model using AutoResearch. Over 14 hours on a single H100 GPU, Fugu ran > 100 experiments. It iteratively edited the training code, ran tests, and kept any changes that successfully lowered the validation error rate.
Watch the animation. The callouts track every time Fugu Ultra autonomously discovered a new improvement across batch size, model depth, learning rates, and optimizer settings.
We pitted Fugu against three frontier models (Gemini 3.1 Pro, Opus 4.8, and GPT 5.5). To keep the focus purely on agentic behavior rather than brand wars, we anonymized them as Models A, B, and C.
The Results:
• Fugu Ultra (bold red) finished with the best mean performance (0.9774). • Fugu Ultra also achieved the best single run of the entire experiment (0.9748), leading every single baseline.
For long horizon, agentic ML research, using Fugu to dynamically orchestrate a pool of strong models significantly outperforms relying on any individual monolithic model.










