«`html
OpenThoughts: A Scalable Supervised Fine-Tuning (SFT) Data Curation Pipeline for Reasoning Models
Understanding the Target Audience
The target audience for OpenThoughts includes researchers, data scientists, and AI practitioners focused on improving reasoning models. Their pain points often revolve around the challenges of accessing comprehensive methodologies for building reasoning models, the high costs associated with teacher inference and model training, and the limitations of existing data curation methods. Their goals include developing more effective reasoning capabilities, optimizing data sourcing strategies, and enhancing model performance. They are interested in technical specifications, peer-reviewed research, and practical applications of AI in business. Communication preferences lean towards concise, data-driven content that highlights empirical results and case studies.
The Growing Complexity of Reasoning Data Curation
Recent reasoning models, such as DeepSeek-R1 and o3, have demonstrated exceptional performance in mathematical, coding, and scientific domains through techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the methodologies behind these advanced reasoning models remain largely undisclosed, complicating research efforts. While SFT data curation has proven effective for enhancing reasoning capabilities, many existing initiatives focus on limited design choices, such as relying solely on human-written questions or single teacher models. Exploring the broad design space for generating question-answer pairs incurs significant costs related to teacher inference and model training.
OpenThoughts: A Scalable Framework for SFT Dataset Development
A collaborative effort from researchers at Stanford University, the University of Washington, BespokeLabs.ai, Toyota Research Institute, UC Berkeley, and 12 additional organizations has led to the development of OpenThoughts, a new state-of-the-art (SOTA) open reasoning data curation framework. OpenThoughts employs a progressive approach across three iterations:
- OpenThoughts-114K: Scales the Sky-T1 pipeline with automated verification.
- OpenThoughts2-1M: Enhances data scale through augmented question diversity and synthetic generation strategies.
- OpenThoughts3-1.2M: Incorporates findings from over 1,000 ablation experiments to create a simple, scalable, and high-performing data curation pipeline.
The resulting model, OpenThinker3-7B, achieves state-of-the-art performance among open-data models at the 7B scale.
Evaluation Insights and Benchmark Performance
The evaluation of the OpenThoughts pipeline provides critical insights into question sourcing, mixing, filtering, and teacher models. Key findings include:
- CodeGolf and competitive coding questions yield the highest performance for coding tasks (average scores of 25.3-27.5).
- LLM-generated and human-written questions excel in mathematics (scores of 58.8-58.5).
- Physics StackExchange questions combined with chemistry textbook extractions perform best in science (scores of 43.2-45.3).
Combining multiple question sources tends to degrade performance, with optimal results showing a 5% accuracy improvement over diverse mixing strategies. In terms of teacher models, QwQ-32B outperforms DeepSeek-R1 in knowledge distillation, achieving an accuracy improvement of 1.9-2.6%.
Conclusion
The OpenThoughts project illustrates that systematic experimentation can significantly advance SFT data curation for reasoning models. The development of OpenThoughts3-1.2M represents a state-of-the-art open-data reasoning dataset across science, mathematics, and coding domains. The OpenThinker3-7B model demonstrates superior performance among open-data reasoning models at its scale. However, several limitations remain, including unexplored RL approaches, staged fine-tuning, and curriculum learning strategies. Future research should focus on cross-domain transfer effects when optimizing individual domains versus overall performance and understanding scaling dynamics as student models approach teacher capabilities.
Further Reading and Resources
For more information, check out the Paper, Project Page, and GitHub Page. All credit for this research goes to the researchers involved in this project. Follow us on Twitter and join our 99k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.
«`