NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
Introduction
Reasoning capabilities are essential to the advancement of AI systems. The introduction of OpenAI’s o1 encouraged significant interest in building reasoning models through large-scale reinforcement learning (RL) approaches. While the open-sourcing of DeepSeek-R1 empowered the community to innovate state-of-the-art reasoning models, important technical details, such as data curation strategies and specific RL training recipes, were missing from the initial report, hindering replication and leading to fragmented research efforts.
Challenges in Current Approaches
Training language models for reasoning in math and coding domains typically relies on pretraining and supervised fine-tuning. Early RL initiatives using domain-specific reward models showed limited success due to challenges inherent in mathematical and coding tasks. More recent methods, following the release of DeepSeek-R1, have explored rule-based verification methods. However, these efforts are often confined to a single domain and lack comprehensive benchmark evaluations, leading to issues with training stability.
NVIDIA’s Innovative Approach
Researchers from NVIDIA have demonstrated that large-scale RL can significantly enhance the reasoning capabilities of strong small- and mid-sized models. Their approach utilizes a straightforward sequential training strategy: first training on math-only prompts, followed by code-only prompts. This methodology reveals that math-only RL enhances performance on mathematical benchmarks and improves coding tasks. Extended iterations of code-only RL further boost code performance while minimally affecting math results.
Data Curation Pipeline
A robust data curation pipeline has been developed to collect challenging prompts accompanied by high-quality, verifiable answers and test cases, facilitating verification-based RL across both math and coding domains. This pipeline merges the DeepScaler and NuminaMath datasets for math, covering algebra, combinatorics, number theory, and geometry, while applying strict filtering to exclude unsuitable content. For code, datasets are curated from competitive programming platforms, including comprehensive test cases that address edge cases.
Performance Outcomes
The AceReason-Nemotron-7B model achieved a remarkable 14.5% and 14.6% accuracy improvement on AIME 2024/2025, and a 14.2% and 8% increase on LiveCodeBench v5/v6, compared to initial supervised fine-tuning models. The 14B variant exceeded larger models such as DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B, establishing itself as best-in-class among open RL-based reasoning models. Notably, AceReason-Nemotron-14B outperformed OpenMath-14B/32B on AIME benchmarks by 2.1%/4.4% and outperformed OpenCodeReasoning-14B by 1.7%/0.8% on LiveCodeBench.
Conclusion
In summary, the research indicates that large-scale RL enhances the reasoning capabilities of strong small- and mid-sized supervised fine-tuning models. The sequential domain-specific training approach, focusing first on math followed by code, reveals that mathematical reasoning training notably improves performance across both domains. The data curation pipeline facilitates verification-based RL, underscoring the method’s effectiveness in pushing the boundaries of model reasoning and establishing new performance benchmarks.
Further Reading
Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and subscribe to our newsletter.