OpenThoughts Releases Fully Open Data Pipeline For Agentic Models

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#613inTech

OpenThoughts-Agent: Data Recipes for Agentic Models

"a fully open data curation pipeline for training agentic models"

"more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline"

Key findings: • As with reasoning data, the choice of instructions is among the most important factors in our data pipeline.

• The strongest model by benchmark performance does not necessarily make the best teacher.

• Filtering training data to retain the execution traces with more model turns improves the resulting training sets.

• Repeating the top few sources leads to diminishing returns in our largest training runs, and we therefore expand the set of data sources to increase diversity.

"We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks"

9:48 PM · Jun 23, 2026 · 1.9K Views

OpenThoughts

OPENTHOUGHTSVia

#613

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

website: https://www.openthoughts.ai/ abs: https://arxiv.org/abs/2606.24855