OPUS improves LLM pre-training efficiency by aligning iteration-by-iteration data selection with the optimizer's geometry

Original post

There is now a smarter way to pick data for training LLMs!

Enter OPUS!

This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute.

The proposed method dynamically and intelligently selects the most impactful data for LLM pre-training in every single training iteration, bringing principled, continuous data optimization to the forefront.

This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks.

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Paper: https://arxiv.org/pdf/2602.05400

Our report: https://mp.weixin.qq.com/s/xzmjviMMwX20tcjwutNmxQ

📬 #PapersAccepted by Jiqizhixin

2:23 AM · May 24, 2026 · 76.9K Views

2602.05400

ARXIV.ORGVia

VIEWS47.5KBOOKMARKS422LIKES476RETWEETS36REPLIES3

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

very ambitious

机器之心 JIQIZHIXIN@jiqizhixin

There is now a smarter way to pick data for training LLMs!

Enter OPUS!

This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute.

This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks.

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Paper: https://arxiv.org/pdf/2602.05400

Our report: https://mp.weixin.qq.com/s/xzmjviMMwX20tcjwutNmxQ

📬 #PapersAccepted by Jiqizhixin

37d47.5K476422

umumu@umi33563

@teortaxesTex Hm, not sure I like this

37d3311

Adel Bucetta@adelbucetta

@jiqizhixin the honest answer is that opus still leaves humans doing the work to define 'impactful' data

37d812

Moon@MoonL88537

@umi33563 @teortaxesTex lol, automated benchmaxxing. I kind of like this anyway though? i think the general idea is legit, possibly gold.

being able to profile trajectories is huge

37d511

umumu@umi33563

@MoonL88537 @teortaxesTex >automated benchmaxxing yeah, my first thought as well. But pondered a bit more, so, this moves big and very difficult problem (what's generally optimal dataset for training) to very difficult problem (what's generally optimal small proxy dataset), which might be tractable.

36d471

Lazarz@Laz4rz

@teortaxesTex Haven’t read the abstract even but graphs look banger

36d2042

Alexa Web3 (e/acc)@alexabelonix

@jiqizhixin nice build.

37d750

Abhijith@Elon_einstein

@jiqizhixin Interesting idea. It's efficient & performing great till 40B. How can we estimate it beyond for larger ranges? In trillions for example. (Yet to read the paper)

36d287

Taha ⵣ@mlnomadpy

@teortaxesTex Just use knn

36d113

umumu@umi33563

@MoonL88537 @teortaxesTex Also, I like distribution guided optimization idea without reservations.

36d201

metamyth 🧃@not_amyth

@adelbucetta @jiqizhixin oh is it. can’t we define atleast a semi automated heuristic that scales

36d3