OpenThoughts-Agent: Data Recipes for Agentic Models
"a fully open data curation pipeline for training agentic models"
"more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline"
Key findings: • As with reasoning data, the choice of instructions is among the most important factors in our data pipeline.
• The strongest model by benchmark performance does not necessarily make the best teacher.
• Filtering training data to retain the execution traces with more model turns improves the resulting training sets.
• Repeating the top few sources leads to diminishing returns in our largest training runs, and we therefore expand the set of data sources to increase diversity.
"We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks"
