NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning

Introduction

Reasoning capabilities are essential to the advancement of AI systems. The introduction of OpenAI’s o1 encouraged significant interest in building reasoning models through large-scale reinforcement learning (RL) approaches. While the open-sourcing of DeepSeek-R1 empowered the community to innovate state-of-the-art reasoning models, important technical details, such as data curation strategies and specific RL training recipes, were missing from the initial report, hindering replication and leading to fragmented research efforts.

Challenges in Current Approaches

Training language models for reasoning in math and coding domains typically relies on pretraining and supervised fine-tuning. Early RL initiatives using domain-specific reward models showed limited success due to challenges inherent in mathematical and coding tasks. More recent methods, following the release of DeepSeek-R1, have explored rule-based verification methods. However, these efforts are often confined to a single domain and lack comprehensive benchmark evaluations, leading to issues with training stability.

NVIDIA’s Innovative Approach

Researchers from NVIDIA have demonstrated that large-scale RL can significantly enhance the reasoning capabilities of strong small- and mid-sized models. Their approach utilizes a straightforward sequential training strategy: first training on math-only prompts, followed by code-only prompts. This methodology reveals that math-only RL enhances performance on mathematical benchmarks and improves coding tasks. Extended iterations of code-only RL further boost code performance while minimally affecting math results.

Data Curation Pipeline

A robust data curation pipeline has been developed to collect challenging prompts accompanied by high-quality, verifiable answers and test cases, facilitating verification-based RL across both math and coding domains. This pipeline merges the DeepScaler and NuminaMath datasets for math, covering algebra, combinatorics, number theory, and geometry, while applying strict filtering to exclude unsuitable content. For code, datasets are curated from competitive programming platforms, including comprehensive test cases that address edge cases.

Performance Outcomes

The AceReason-Nemotron-7B model achieved a remarkable 14.5% and 14.6% accuracy improvement on AIME 2024/2025, and a 14.2% and 8% increase on LiveCodeBench v5/v6, compared to initial supervised fine-tuning models. The 14B variant exceeded larger models such as DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B, establishing itself as best-in-class among open RL-based reasoning models. Notably, AceReason-Nemotron-14B outperformed OpenMath-14B/32B on AIME benchmarks by 2.1%/4.4% and outperformed OpenCodeReasoning-14B by 1.7%/0.8% on LiveCodeBench.

Conclusion

In summary, the research indicates that large-scale RL enhances the reasoning capabilities of strong small- and mid-sized supervised fine-tuning models. The sequential domain-specific training approach, focusing first on math followed by code, reveals that mathematical reasoning training notably improves performance across both domains. The data curation pipeline facilitates verification-based RL, underscoring the method’s effectiveness in pushing the boundaries of model reasoning and establishing new performance benchmarks.

NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning

NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning

Introduction

Challenges in Current Approaches

NVIDIA’s Innovative Approach

Data Curation Pipeline

Performance Outcomes

Conclusion

Further Reading