@suchenzang @wandering_mush A lot of working in an organization is organizing work such that you create credit for other people, and incentivize them to do things that are on the net positive
the reason "mid-training" fails to sustain its own paradigm of research at some of these major labs comes down to individual incentives more than scientific ones (despite how much people claim it to be irrelevant in some fully e2e codesign-with-applications utopia).
to explain the dynamics here, we have to first look at how "post-training" developed as a separate entity: pre-training and scaling these models historically have been more of an infra problem coupled with "rigorous scaling laws" to justify compute-time investment that exponentially increases YoY.
this type of scale and complexity attracted more of the "purists", aka people who were relatively early to buy into the scaling paradigm and early to language modeling in general.
(the modeling domains of vision, robotics, speech, etc historically have not required similar levels of multi-data-center level scale to be useful for practical applications, though they certainly have / continue to have their own individual scaling-up-to-utility curves)
by the time LLMs took off, interest to work in this space also grew exponentially, but these "core problems" of scaling LLMs didn't similarly grow in complexity to absorb this new interest.
furthermore, there was still a very real, very significant gap in bridging the raw LLMs to domains/applications, so "post-training" developed both as a way of organizing people alongside the final deployment surface area, in order for an R&D artifact to successfully make its way into actual use-cases that made $$ or hype.
(as an aside, "post-training" is also where you see the most churn, since it involves 10x-100x more people, each either aiming for bits of the jagged frontier or getting locked out of bigger pools of compute for bigger scaling efforts)
the unfortunate bit here is that as the post-train surface and sprawl increase, so does the disconnect with pretraining decisions that are made in a vacuum (relatively speaking). in theory, this can be bridged if coherent metrics and signals are backproppable into pretraining decisions, but that endeavor similarly suffers from practical resource/execution constraints.
so again, in theory, there is ample room for a "mid-training" space to grow to fill this judgement/signal gap. however, if your organization incentivizes individual-hero-promo-maxxers, what this turns out to be is a turf-war of getting squeezed out by both pretraining and post-training teams. in order for you to succeed, you need to lube up two major political organizations, both of whom view you to be a threat to their zero-sum land of finite impacc/compute allocations (and more concerningly, a threat against a saturation of their own success metrics that have all been hill-climbed into near irrelevance).
so at some point, if you can deliver value in this space, you'll take a hard look at these challenges and wonder whether the upside of success is worth fighting these largely non-technical political battles wrapped over actual technical challenges. either way, with enough time and reorgs, you'll simply get absorbed into pre-/post-train orgs anyway, as a duality of power is much easier to balance at an exec level, compared to settling squabbles amongst a triumvirate.