OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs

Understanding the Target Audience

The target audience for OMEGA comprises researchers, data scientists, AI practitioners, and business leaders interested in advancing the capabilities of large language models (LLMs) in mathematical reasoning. Their pain points include the limitations of current benchmarks in evaluating model performance, the need for more robust datasets that challenge LLMs, and the desire for practical applications of AI in business contexts.

Goals of this audience include improving the accuracy and creativity of LLMs in solving complex problems, understanding the nuances of generalization in AI, and applying these insights to real-world business challenges. Their interests lie in the intersection of AI, mathematics, and data analytics, with a preference for clear, technical communication that emphasizes empirical findings and actionable insights.

Introduction to Generalization in Mathematical Reasoning

Large-scale language models with long chain-of-thought (CoT) reasoning, such as DeepSeek-R1, have demonstrated effectiveness in Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement Learning rely on limited techniques, such as repeating known algebra rules or defaulting to coordinate geometry in diagram problems. These models often lack true mathematical creativity, encountering difficulties with complex tasks that require original insights. Current math datasets do not adequately assess the math skills that reinforcement learning models can learn. Large-scale corpora contain a variety of math questions across topics and difficulty levels, complicating the isolation of specific reasoning skills.

Limitations of Current Mathematical Benchmarks

Existing methods, such as out-of-distribution generalization, focus on handling test distributions that differ from training data, which is essential for mathematical reasoning, physical modeling, and financial forecasting. Compositional generalization techniques aim to help models systematically combine learned skills. Researchers have developed datasets to benchmark mathematical abilities, including GSM8K, MinervaMath, AIME, OlympiadBench, NuminaMath, and BigMath. However, these approaches either lack sufficient challenge for modern LLMs or fail to provide granular analysis.

Introducing OMEGA: A Controlled Benchmark for Reasoning Skills

Researchers from the University of California, Ai2, the University of Washington, and dmodel.ai have proposed OMEGA, a benchmark designed to evaluate three dimensions of out-of-distribution generalization, inspired by Boden’s typology of creativity. OMEGA creates matched training and test pairs to isolate specific reasoning skills across three dimensions: Exploratory, Compositional, and Transformative. Its test and training problems are constructed using carefully engineered templates, allowing precise control over diversity, complexity, and the specific reasoning strategies required for solutions. Moreover, it employs 40 templated problem generators across six mathematical domains: arithmetic, algebra, combinatorics, number theory, geometry, and logic & puzzles.

Evaluation on Frontier LLMs and Reinforcement Learning Setup

Four frontier models are evaluated, including DeepSeek-R1, Claude-3.7-Sonnet, OpenAI-o3-mini, and OpenAI-o4-mini, across different complexity levels. For reinforcement learning generalization experiments, the framework applies the GRPO algorithm on 1,000 training problems using Qwen2.5-7B-Instruct and Qwen2.5-Math-7B models. Exploratory generalization involves training on restricted complexity levels and evaluating on higher complexity problems. Compositional generalization trains models on individual skills in isolation and tests their ability to combine and apply these skills effectively. Transformational generalization trains on conventional solution approaches and evaluates performance on problems requiring unconventional strategies.

Performance Observations and Model Behavior Patterns

Reasoning LLMs tend to perform worse as problem complexity increases, often finding correct solutions early but consuming excessive tokens on unnecessary verification. Reinforcement learning applied only to low-complexity problems enhances generalization to medium-complexity problems, with larger gains on in-domain examples than out-of-distribution ones, indicating reinforcement learning’s effectiveness at reinforcing familiar patterns. For instance, in the Zebra Logic domain, the base model achieves only 30% accuracy. However, reinforcement learning training increased performance by 61 points on in-domain examples and 53 points on out-of-distribution examples without supervised fine-tuning.

Conclusion: Toward Advancing Transformational Reasoning

In summary, researchers introduced OMEGA, a benchmark that isolates and evaluates three axes of out-of-distribution generalization in mathematical reasoning: explorative, compositional, and transformative. The empirical study reveals three insights: (a) Reinforcement learning fine-tuning significantly improves performance on in-distribution and exploratory generalization tasks, (b) Reinforcement learning’s benefits for compositional tasks are limited, and (c) Reinforcement learning fails to induce genuinely new reasoning patterns. These findings underscore a fundamental limitation: while reinforcement learning can amplify problem-solving breadth and depth, it falls short in enabling the creative leaps essential for transformational reasoning. Future work should explore curriculum scaffolding and meta-reasoning controllers.

For further details, check out the Paper, Project Page, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 100k+ ML SubReddit, and subscribe to our Newsletter.