Pretraining Data Curation Produces More Concise AI Models

VIEWS338LIKES8

For two years we've made the same case: data is the most underinvested, highest-leverage lever in ML.

This is one more dimension of it: output length isn't a fixed property of a model, it's a property of the data it learned from.

Ari Morcos@arimorcos

New @datologyai work: a 4B VLM curated for concision answers correctly for 35× less compute than Qwen3.5-4B, with similar performance.

Same size, same task. The whole gap is how many tokens each model spends. 🧵

3h33880

REPLIES1

Ari Morcos@arimorcos

And because it's learned at training time, the saving compounds: pay once, collect on every inference the model ever runs. As inference becomes the dominant cost of AI, that's the whole game.

Paper: https://arxiv.org/abs/2606.25432

Blog: https://www.datologyai.com/blog/brevity-is-the-soul-of-inference-efficiency

Ari Morcos@arimorcos

For two years we've made the same case: data is the most underinvested, highest-leverage lever in ML.

This is one more dimension of it: output length isn't a fixed property of a model, it's a property of the data it learned from.

3h33560

Ari Morcos@arimorcos

And check out @leavittron's thread here:

Matthew Leavitt@leavittron

What if you could induce models to be more concise via pretraining data curation?

3h22030