/AI3h ago

Samip of Q! unveils q0, a population-based system that scales multi-epoch pretraining to 960 epochs without performance saturation

The approach outperforms single-model and naive ensembling baselines.

11109338710.5K

Original posts

Reposts

#148

Original post

Andrew Gordon Wilson#148

Samip@industriaalist

1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBerman

8:43 AM · Jun 4, 2026 · 10.9K Views

/AI3h ago

Samip of Q! unveils q0, a population-based system that scales multi-epoch pretraining to 960 epochs without performance saturation

The approach outperforms single-model and naive ensembling baselines.

--0--

Original posts

Reposts

#148

Original post

Andrew Gordon Wilson#148

Samip@industriaalist

1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBerman

8:43 AM · Jun 4, 2026 · 10.9K Views

Sentiment

Many users praise the Q0 paper for enabling optimal multi-epoch pretraining with model populations because its population frame breaks the data wall narrative and introduces promising new training primitives.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS69LIKES8

Samip@industriaalist

2/ Paper: https://arxiv.org/abs/2606.03938

q0 is built on one intuition, motivated by Solomonoff induction: instead of training one perfect model, train a population of diverse models and aggregate predictions. Everything in the algorithm follows from this one goal of efficiently training a population. It comes down to three core primitives:

3h6981

BOOKMARKS1

Samip@industriaalist

6/ I'm confident this beats standard pretraining at any budget, even a single epoch, but the biggest limitation is inference cost. An ensemble of K models means K forward passes. It's effectively a way of growing the combined model's parameter count, like scaling depth but without the saturation depth scaling faces.

As with any large model, the fix is distillation into a single model, which tends to work magically well, but we leave that to future work.

RETWEETS16

Samip@industriaalist

1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBerman

REPLIES1

Ward Plunet@StartupYou

@industriaalist @threadreaderapp please #unroll

2h21

Posts from X

Most Activity

RETWEETS16

Samip@industriaalist

1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?

Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.

w/ @bishmdl76 @akshayvegesna @ShmuelBerman

3h10.9K11591