Martín Casado, a16z general partner, warns open-source AI faces $2 billion training and distillation bottlenecks, drawing developer pushback
Critics argue frontier model training costs under $1 billion.
@martin_casado At the true frontier they don’t - they special in per dollar performance and supporting a diverse set of usecases. Separate but overlapping ecosystems.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
The debate on if open or closed models win comes down to if there is disproportionate value to marginally better intelligence.
The believers of this sit across from the open models will be good enough camp.
Closed models will stay slightly smarter. Open models will be cheaper.
It's a timing question. I expect for the next couple of years we'll continue to see only closed source models at the frontier because there are a lot more improvements that come with high costs. As that starts to stabilize, more of a chance for open source to catch up.
(But I still see Anthropic, OAI, other frontier labs positioned to do well even when that happens, the infra + agent functionality bundled with the model is meaningful)
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
I think we'll see a combination of (i) algorithmic + computational efficiencies reducing training costs -- won't be cheap by any means but less than expectations (ii) people willing to fund given the market opportunity -- the fact that open source is still active given how much costs have increased since ~4 years ago is pretty remarkable (iii) "indirect" distillation probably will stay for a long time
@maithra_raghu It's not clear if distillation is cut off, open source can catch up. Let's say at that point it costs $5b to train a competitive model. Who would pay that for something open source?
@martin_casado there are two large, capable, and well-resourced entities with clear strategic interests in ensuring open models keep up: China and Nvidia
preventing distillation and capturing market share are in tension. it'll be hard to distill GPT-7-BioChem, easy to distill Default Claude.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
@martin_casado the "killer app" of open models is customization, which is always going to be easier and cheaper than closed model customization.
open models are within N months of the real frontier, and closing the gap in-domain is cheap. big win for anything workflow-shaped at scale.
@martin_casado there are two large, capable, and well-resourced entities with clear strategic interests in ensuring open models keep up: China and Nvidia preventing distillation and capturing market share are in tension. it'll be hard to distill GPT-7-BioChem, easy to distill Default Claude.
> - it costs $2-4B to train a current gen model
I'd like to see the mafs on that far as I can tell, "current gen models" are at most (90th percentile) ≈6X DeepSeek V4 Pro in M(active) and 10x in D. That's maaaybe $1B. And I mean Mythos, not Opus/5.5, those are 2-3x cheaper.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
> But it’s hard to tell with all the very cheap capital flooding in. with very high inference margins too It is remarkable that frontier labs pretty much don't compete on cost. Like, they do, but with the shared understanding that >50% margin is sacrosanct. No involution allowed!
@martin_casado @sun_hanchi nobody's doing it for 6 months, the model will be obsolete on release
@martin_casado @sun_hanchi I am saying that the project can take whatever, 1 year, 2 years, but I am not aware even of gossip about training runs >4 months. 2-3 is normal
@teortaxesTex @sun_hanchi Often multiple models are released around a single pre-training.
... also, there is a ton of pricing power on the frontier by being marginally better than everyone else
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
@maithra_raghu It's not clear if distillation is cut off, open source can catch up. Let's say at that point it costs $5b to train a competitive model. Who would pay that for something open source?
It's a timing question. I expect for the next couple of years we'll continue to see only closed source models at the frontier because there are a lot more improvements that come with high costs. As that starts to stabilize, more of a chance for open source to catch up. (But I still see Anthropic, OAI, other frontier labs positioned to do well even when that happens, the infra + agent functionality bundled with the model is meaningful)
... also we know now that the frontier labs are focusing on autocatalytic processes of using models to make models (create GPU kernels, data cleaning etc.).
So autocatalysis will improve economies of scale.
... also, there is a ton of pricing power on the frontier by being marginally better than everyone else
@cwolferesearch yeah unfortunately it's not. I know nothing about Arcee Trinity. But I know a lot about the frontier labs and their training costs.
@martin_casado training cost estimate here is way off... total cost to pretrain arcee trinity large (400B) was ~$20 million including compute, salaries, data, etc. https://www.interconnects.ai/p/arcee-ai-goes-all-in-on-open-models
@willccbb Nice answer.
@martin_casado there are two large, capable, and well-resourced entities with clear strategic interests in ensuring open models keep up: China and Nvidia preventing distillation and capturing market share are in tension. it'll be hard to distill GPT-7-BioChem, easy to distill Default Claude.
Well no, it was about how can open weight labs keeps up with frontier labs. Which are in the billions per run. And they are limiting access to the largest models. And they are starting to employ autocatalytic features. And the market is showing preference for marginally better models.
your question was not about frontier labs... it was about open weight labs, which have a drastically different structure / approach. Publicly-shared numbers for training costs at open labs are 100-1000X smaller than the number you cited. The gap in performance has remained relatively consistent, with public estimates recently finding that open labs are ~3-4 months behind the frontier. First mover cost is way higher, open labs just have to replicate and not fall behind too much.
@teortaxesTex @sun_hanchi Often multiple models are released around a single pre-training.
@martin_casado @sun_hanchi nobody's doing it for 6 months, the model will be obsolete on release
@grok can you summarize the discussion in comments of this post and the primary positions / discussion points?
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
1) I’m very interested how hard in the limit. If I reserve my largest model and only use it directly for services (eg auditing a bank and issuing a report), the chances distill go way down. We’re already seeing signs of this. Holding Mythos back. Not showing reasoning traces. Penalizing third party. Subsidizing first party etc. 2) Yeah for sure. Although to date this hasn’t really worked if viewed through market share. And capital requirements are getting increasingly onerous. If MSFT and Metas next models are closed (as the rumor goes) it’d seem we’re moving away from this.
@martin_casado 1) distillation hard to stop 2) other economically interested parties to backstop? (chip vendors, governments)
1) So, I go back and forth. History would suggest that when you have a few players, and large capital costs, the industry will converge to an oligopoly rather than have margin eroding competition. We saw this with cloud, telcos, chips, etc. Given how effective distillation is, I suspect the trend we're seeing to less access is real. 2) Yup, very much agree.
1) fair enough! do you think this is a likely outcome in the limit? it seems like competition might force the FMs to keep releasing the best models they can afford 2) this seems like china's game for now until some of the us-funded options (reflection etc) are ready. and nation-state budgets haven't come into the picture yet, agree?
@cwolferesearch Oh for sure. I think the number actually is about 2-5% the cost to build a "close enough" model. However the question remains whether being an epsilon better results in sufficient pricing leverage to take the market. Thus far it has. Will that continue? I don't know.
open labs will likely stay 3-6 months behind frontier labs (as has been the case for some time), and they will probably do this while spending 100X less money than frontier labs (as has been the case for some time). Arcee is one example, but also see DeepSeek, Olmo, Nemotron, etc. All of these open models publicize their training costs, the info is openly available.
@cwolferesearch Totally. I've just been surprised how little this is actually reflected in the market.
@martin_casado also worth mentioning the reverse question - if open weight labs can replicate $4B worth of research efforts with ~$20M, do frontier labs have a capital efficiency problem?
@teortaxesTex General rule of thumb right now is 100k gpus for 6 months. There is a lot of ways you can back into this number. But of course it's all from industry gossip around company raises, GPU procurement, gross burn etc.
> - it costs $2-4B to train a current gen model I'd like to see the mafs on that far as I can tell, "current gen models" are at most (90th percentile) ≈6X DeepSeek V4 Pro in M(active) and 10x in D. That's maaaybe $1B. And I mean Mythos, not Opus/5.5, those are 2-3x cheaper.
@martin_casado That’s not a frontier game but cost-efficiency+very likely a subagent market. Current debate over tokenomics is very reminiscent of early open source.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
I originally thought this S curve meant we should be bullish on OS models, and I still do to some extent
But I think it's underrated how much there may be an increasing frontier of complexity that can give value in many knowledge work domains
And if that's true, then to keep up in a competitive situation (eg ultimately economic competition between countries), you need to stay on that frontier, because everything below that creates differentially less value
That said, it may take some time to get to the point where gains in a competitive market concentrate heavily at the frontier, because the models are improving fast enough that they just unlock a huge bucket of low hanging fruit.
It seems more plausible we see a period of automation in any given profession over eg 5 years, before further efficiency gains and thus competitive advantage becomes closer to frontier-only, if there is indeed a high frontier of complexity.
I originally thought this S curve meant we should be bullish on OS models, and I still do to some extent But I think it's underrated how much there may be an increasing frontier of complexity that can give value in many knowledge work domains And if that's true, then to keep up in a competitive situation (eg ultimately economic competition between countries), you need to stay on that frontier, because everything below that creates differentially less value
@martin_casado 1) distillation hard to stop 2) other economically interested parties to backstop? (chip vendors, governments)
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
1) fair enough! do you think this is a likely outcome in the limit? it seems like competition might force the FMs to keep releasing the best models they can afford 2) this seems like china's game for now until some of the us-funded options (reflection etc) are ready. and nation-state budgets haven't come into the picture yet, agree?
1) I’m very interested how hard in the limit. If I reserve my largest model and only use it directly for services (eg auditing a bank and issuing a report), the chances distill go way down. We’re already seeing signs of this. Holding Mythos back. Not showing reasoning traces. Penalizing third party. Subsidizing first party etc. 2) Yeah for sure. Although to date this hasn’t really worked if viewed through market share. And capital requirements are getting increasingly onerous. If MSFT and Metas next models are closed (as the rumor goes) it’d seem we’re moving away from this.
@martin_casado training cost estimate here is way off... total cost to pretrain arcee trinity large (400B) was ~$20 million including compute, salaries, data, etc.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?
your question was not about frontier labs... it was about open weight labs, which have a drastically different structure / approach.
Publicly-shared numbers for training costs at open labs are 100-1000X smaller than the number you cited. The gap in performance has remained relatively consistent, with public estimates recently finding that open labs are ~3-4 months behind the frontier.
First mover cost is way higher, open labs just have to replicate and not fall behind too much.
@cwolferesearch yeah unfortunately it's not. I know nothing about Arcee Trinity. But I know a lot about the frontier labs and their training costs.
@martin_casado also worth mentioning the reverse question - if open weight labs can replicate $4B worth of research efforts with ~$20M, do frontier labs have a capital efficiency problem?
Well no, it was about how can open weight labs keeps up with frontier labs. Which are in the billions per run. And they are limiting access to the largest models. And they are starting to employ autocatalytic features. And the market is showing preference for marginally better models.
open labs will likely stay 3-6 months behind frontier labs (as has been the case for some time), and they will probably do this while spending 100X less money than frontier labs (as has been the case for some time). Arcee is one example, but also see DeepSeek, Olmo, Nemotron, etc. All of these open models publicize their training costs, the info is openly available.
Well no, it was about how can open weight labs keeps up with frontier labs. Which are in the billions per run. And they are limiting access to the largest models. And they are starting to employ autocatalytic features. And the market is showing preference for marginally better models.
@martin_casado yeah I agree
@cwolferesearch Oh for sure. I think the number actually is about 2-5% the cost to build a "close enough" model. However the question remains whether being an epsilon better results in sufficient pricing leverage to take the market. Thus far it has. Will that continue? I don't know.
@martin_casado Why would open model labs not deploy $2B to train? Many are well capitalized and generating substantial revenues through APIs and model licensing.
Can someone explain to me how open source models can keep up if ... - pre-training isn't saturated - it costs $2-4B to train a current gen model - distillation is increasingly hard as access to the most powerful models gets blocked .. ?