←back to Blog

Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math

Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math

In natural language processing (NLP), reinforcement learning (RL) methods, such as reinforcement learning with human feedback (RLHF), have been used to enhance model outputs by optimizing responses based on feedback signals. A specific variant, reinforcement learning with verifiable rewards (RLVR), extends this approach by employing automatic signals, such as mathematical correctness or syntactic features, as feedback. This enables large-scale tuning of language models and enhances reasoning abilities without extensive human supervision.

A significant challenge in machine learning is building models that can reason effectively with minimal or imperfect supervision. In mathematical problem-solving tasks, where correct answers may not be readily available, researchers face difficulties in guiding a model’s learning process. Traditional models learn from ground-truth data, but accurately labeling extensive datasets, especially for complex reasoning tasks, is impractical. Consequently, there is an ongoing debate about whether models can learn to reason when exposed to noisy, misleading, or incorrect training signals. This is crucial because models relying heavily on perfect feedback may fail to generalize effectively in real-world scenarios.

Several techniques aim to improve models’ reasoning capabilities through RL, with RLVR being a key focus. Traditionally, RLVR has utilized “ground truth” labels provided by humans or automated tools to deliver rewards during training. Some methodologies have relaxed this requirement by incorporating majority vote labels or simple format-based heuristics, rewarding answers that adhere to specific output styles. Other approaches have tested random rewards, offering positive signals without considering answer correctness. These explorations seek to determine whether models can learn with minimal guidance, but they often focus on specific models, such as Qwen, raising concerns regarding generalizability across different architectures.

Researchers from the University of Washington, the Allen Institute for AI, and UC Berkeley investigated this question by testing various reward signals on Qwen2.5-Math, a family of large language models fine-tuned for mathematical reasoning. They assessed ground-truth rewards, majority-vote rewards, format rewards based on boxed expressions, random rewards, and incorrect rewards. Notably, they discovered that even completely spurious signals, such as random rewards and rewards for incorrect answers, could yield substantial performance gains in Qwen models. For instance, training Qwen2.5-Math-7B on MATH-500 with ground-truth rewards resulted in a 28.8% improvement, while using incorrect labels produced a 24.6% gain. Random rewards still achieved a 21.4% boost, and format rewards led to a 16.4% improvement. Majority-vote rewards provided a 26.5% accuracy gain. Similar trends were observed in Qwen2.5-Math-1.5B, where format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. In contrast, the same reward strategies did not yield similar advantages for other model families, such as Llama3 and OLMo2, which experienced minimal or negative changes when trained with spurious rewards.

The research team utilized RLVR training to fine-tune models with varied reward signals, eliminating the need for ground-truth supervision. They found that Qwen models could still generate high-quality reasoning outputs without correct answers. A key insight was that Qwen models exhibited a behavior termed “code reasoning,” generating math solutions structured like code, particularly in Python-like formats, irrespective of the meaningfulness of the reward signal. This code reasoning tendency increased over training, rising from 66.7% to over 90% in Qwen2.5-Math-7B when trained with spurious rewards. Answers that included code reasoning showed higher accuracy rates, around 64%, compared to just 29% for answers lacking such reasoning patterns. These patterns emerged consistently, suggesting that spurious rewards may unlock latent capabilities learned during pretraining rather than introducing new reasoning skills.

Performance data underscored the robustness of Qwen models. Gains from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) were nearly comparable to the ground-truth reward gain of 28.8%. Similar trends were observed across tasks, such as AMC, where format, incorrect, and random rewards led to approximately an 18% improvement, only slightly lower than the 25% improvement from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) resulted in meaningful gains, although the advantage of ground-truth labels (+12.8%) remained significant, particularly for AIME2025 questions created after model pretraining cutoffs.

Key Takeaways

  • Qwen2.5-Math-7B gained 28.8% accuracy on MATH-500 with ground-truth rewards, but also 24.6% with incorrect rewards, 21.4% with random rewards, 16.4% with format rewards, and 26.5% with majority-vote rewards.
  • Code reasoning patterns emerged in Qwen models, increasing from 66.7% to over 90% under RLVR, which boosted accuracy from 29% to 64%.
  • Non-Qwen models, such as Llama3 and OLMo2, did not exhibit similar improvements, with Llama3.1-8B experiencing up to 8.5% performance drops on spurious rewards.
  • Gains from spurious signals appeared within 50 training steps in many cases, suggesting rapid elicitation of reasoning abilities.
  • The research cautions against generalizing RLVR results based on Qwen models alone, as spurious reward effectiveness is not universal.

In conclusion, while Qwen models can leverage spurious signals to enhance performance, the same is not true for other model families. Non-Qwen models, like Llama3 and OLMo2, demonstrated flat or negative performance changes when trained with spurious signals. This research emphasizes the importance of validating RLVR methods across a diverse range of models instead of relying solely on Qwen-centric results.

Check out the Paper, Official Release and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.