1d ago

DSGym framework for evaluating and training data science agents accepted to ICML 2026 reveals that many existing benchmarks allow shortcut solutions without actual analysis

Releases DSGym-Tasks ecosystem filtering flawed benchmarks and expanding scientific coverage.

0
Original post

🚀 Excited to share that #DSGym has been accepted to ICML 2026! DSGym is a holistic, unified framework for evaluating and training data science agents with standardized abstractions and a modular architecture for adding tasks, agent scaffolds, and tools. In this work, we: 🔍 Show that existing data science benchmarks are vulnerable to shortcuts: agents can often solve tasks without using the actual data. 📊 Release DSGym-Tasks, a curated task ecosystem that standardizes and audits representative benchmarks, filters shortcut-solvable tasks, and expands coverage with new scientific tasks. ⚡ Use DSGym for execution-grounded trajectory synthesis: with only 2K samples, we train a 4B model that outperforms GPT-4o on standardized data analysis benchmarks. 📄 Paper: https://arxiv.org/abs/2601.16344 💻 Code: https://github.com/fannie1208/DSGym 🤗 Dataset: https://huggingface.co/DSGym 🧵👇

4:48 PM · May 18, 2026 View on X
Reposted by

The best gym for data science💪: #DSGym provides a grounded and realistic environment to train and test data science agents.

Accepted to #ICML2026! Great work by @FanNie1208 @JunlinWang3 @_harperhua @federicobianchy @ykwon_0407 @ZhentingQi @oq_35 @ShangZhu18 @togethercompute

Fan NieFan Nie@FanNie1208

🚀 Excited to share that #DSGym has been accepted to ICML 2026! DSGym is a holistic, unified framework for evaluating and training data science agents with standardized abstractions and a modular architecture for adding tasks, agent scaffolds, and tools. In this work, we: 🔍 Show that existing data science benchmarks are vulnerable to shortcuts: agents can often solve tasks without using the actual data. 📊 Release DSGym-Tasks, a curated task ecosystem that standardizes and audits representative benchmarks, filters shortcut-solvable tasks, and expands coverage with new scientific tasks. ⚡ Use DSGym for execution-grounded trajectory synthesis: with only 2K samples, we train a 4B model that outperforms GPT-4o on standardized data analysis benchmarks. 📄 Paper: https://arxiv.org/abs/2601.16344 💻 Code: https://github.com/fannie1208/DSGym 🤗 Dataset: https://huggingface.co/DSGym 🧵👇

11:48 PM · May 18, 2026 · 21.7K Views
2:24 PM · May 19, 2026 · 4.2K Views