There is now a smarter way to pick data for training LLMs!
Enter OPUS!
This is an ICML Oral paper from SJTU, Alibaba, UW–Madison, UIUC, and Mila - Quebec AI Institute.
The proposed method dynamically and intelligently selects the most impactful data for LLM pre-training in every single training iteration, bringing principled, continuous data optimization to the forefront.
This approach aims to significantly boost training efficiency and yield higher-quality LLMs, outperforming conventional static data selection methods across diverse language tasks.
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Paper: https://arxiv.org/pdf/2602.05400
Our report: https://mp.weixin.qq.com/s/xzmjviMMwX20tcjwutNmxQ
📬 #PapersAccepted by Jiqizhixin







