Elie Bakouch says Fugu's AI orchestrator lacks true model swappability and underperforms Claude 3 Opus on SWE-Bench Pro

VIEWS1.8KLIKES15REPLIES3

@eliebakouch @basedjensen wait if it’s turn-based, they just bust the cache at every turn??

elie@eliebakouch

to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty"

i've also read the tech report to get an opinion on the technical stuff:

fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier

about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)

the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task

the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release

3h1.8K150

RETWEETS1

Chris 🇨🇦@llm_wizard

I like any "systems of models" research because I believe "systems of models" is the way to go, even if no one is particularly good at it yet.

The relatively simple argument to make from this, like others of the same kind, is that you can do this with arbitrary systems of models.

The more people talk about this, the sooner we get the good version.

1h1371

Ahmad@TheAhmadOsman

@eliebakouch thank you for saying that out loud, saw so many celebratory reactions on this release that left me very confused

nothing here to celebrate here in regards to sovereign / opensource ai

3h32112

am.will@LLMJunky

@eliebakouch It's also not as great as advertised imo

3h5854

elie@eliebakouch

> let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier

realizing that fugu ultra also probably needs to be retrained to make model changes, adding models is not supported and for deleting you can say to not select the model id via a prompt but again they don't quantify how much does it impact performance, actually they don't quantify the "swappable" aspect at all which is one of the main claims lol

elie@eliebakouch

to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty"

i've also read the tech report to get an opinion on the technical stuff:

fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier

about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)

the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task

the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release

2h1.7K80

Aryaman Arora@aryaman2020

@eliebakouch it's technically true that it will never be subject to export controls though!

elie@eliebakouch

to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty"

i've also read the tech report to get an opinion on the technical stuff:

fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier

about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)

the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task

the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release

3h70920

Samian@ApplyWiseAi

@eliebakouch sovereignty" without model weights is just renting with better branding. you're trusting the orchestrator's routing logic AND the model choice... that's two layers of black box

3h65

Amar Patel@amar_patel

@LLMJunky @eliebakouch Thanks for proving out what you and I already knew ❤️

2h121

𝒮@politilols

@ApplyWiseAi @eliebakouch Great AI comment Samian!

2h51

Sarah 🇨🇦 🏳️‍⚧️@SarahLacard

@eliebakouch sakana sometimes (often) does this thing where something gets a bit of traction and they almost immediately release or announce their own version inspired by or riffing on it - happens a lot

this one felt inspired by openrouter fusion to me

3h1632

elie@eliebakouch

@xeophon @basedjensen i think they don't put the full previous turns into the context or smth like that, some kind of compaction or idk. but otherwise yes

3h1412

Tyler Williams@unmodeledtyler

@eliebakouch hell of a write up - thanks for the diligence!

3h1541

josepha_mayo@josepha_mayo

@eliebakouch they are so not transparent they said "you'll be billed on the top tier model when using pool of models" one could even potentially run broke 😂

3h1361