13h ago

OPUS improves LLM pre-training efficiency by aligning iteration-by-iteration data selection with the optimizer's geometry

It reduces misalignment with optimizers like AdamW or Muon.

65374853678.4K

——0——

Original post

#1179@MENHGUINOP

机器之心 JIQIZHIXIN@JIQIZHIXIN

There is now a smarter way to pick data for training LLMs! Enter OPUS! This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute. The proposed method dynamically and intelligently selects the most impactful data for LLM pre-training in every single training iteration, bringing principled, continuous data optimization to the forefront. This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks. OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper: https://arxiv.org/pdf/2602.05400 Our report: https://mp.weixin.qq.com/s/xzmjviMMwX20tcjwutNmxQ 📬 #PapersAccepted by Jiqizhixin

2:23 AM · May 24, 2026

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

very ambitious

机器之心 JIQIZHIXIN@jiqizhixin

9:23 AM · May 24, 2026 · 44.6K Views

11:37 AM · May 24, 2026 · 34.9K Views

OPUS improves LLM pre-training efficiency by aligning iteration-by-iteration data selection with the optimizer's geometry

Cluster engagement

Sentiment