Polaris-4B and Polaris-7B: Post-Training Reinforcement Learning for Efficient Math and Logic Reasoning

Understanding the Target Audience

The target audience for Polaris-4B and Polaris-7B consists of AI researchers, machine learning engineers, and business leaders interested in scalable reasoning models. These professionals are often focused on enhancing AI capabilities for practical applications in various industries, including finance, education, and technology.

Pain Points

Difficulty in scaling reasoning models while maintaining efficiency.
Challenges in balancing training data difficulty with model capability.
Limited methods for adapting training processes to larger models.

Goals

To develop AI models that can perform complex reasoning tasks effectively.
To achieve high accuracy with smaller, resource-efficient models.
To leverage advanced AI for solving real-world problems.

Interests

Innovations in reinforcement learning and model training techniques.
Applications of AI in business and technology.
Research on efficient data handling and processing for AI models.

Communication Preferences

Technical reports and peer-reviewed articles.
Webinars and conferences focused on AI advancements.
Online forums and communities for knowledge sharing.

The Rising Need for Scalable Reasoning Models in Machine Intelligence

Advanced reasoning models are at the forefront of machine intelligence, particularly in areas such as math problem-solving and symbolic reasoning. These models are engineered to execute multi-step calculations and logical deductions, often producing solutions that closely resemble human reasoning processes. Reinforcement learning techniques enhance accuracy post-pretraining; however, scaling these methods while preserving efficiency remains a significant challenge. As the demand for smaller, resource-efficient models capable of high-level reasoning grows, researchers are exploring strategies that focus on data quality, exploration methods, and long-context generalization.

Challenges in Reinforcement Learning for Large Reasoning Architectures

A persistent issue with reinforcement learning for large-scale reasoning models is the disparity between the model’s capabilities and the complexity of the training data. When a model encounters tasks that are too simple, its learning curve stagnates. Conversely, overly challenging data can overwhelm the model, resulting in a lack of learning signals. This difficulty imbalance is especially evident when applying techniques that are effective for smaller models to larger architectures. Additionally, there is a shortage of methods for efficiently adapting rollout diversity and output length during both training and inference, further limiting a model’s reasoning abilities on complex benchmarks.

Limitations of Existing Post-Training Approaches on Advanced Models

Previous methods, such as DeepScaleR and GRPO, have shown that reinforcement learning can enhance the performance of small-scale reasoning models with as few as 1.5 billion parameters. However, applying these same techniques to more advanced models, such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B, results in only marginal improvements or even performance declines. A key limitation is the static nature of data distribution and the restricted diversity of sampling. Most existing approaches do not filter data based on model capability or adjust sampling temperature and response length over time, leading to ineffective scaling on advanced architectures.

Introducing Polaris: A Tailored Recipe for Scalable RL in Reasoning Tasks

Researchers from the University of Hong Kong, Bytedance Seed, and Fudan University have introduced Polaris, a post-training recipe specifically designed to scale reinforcement learning for advanced reasoning tasks. Polaris includes two preview models: Polaris-4B-Preview and Polaris-7B-Preview. Polaris-4B-Preview is fine-tuned from Qwen3-4B, while Polaris-7B-Preview is based on Deepseek-R1-Distill-Qwen-7B. The researchers focused on creating a model-agnostic framework that modifies data difficulty, promotes diverse exploration through controlled sampling temperatures, and extends inference capabilities through length extrapolation. These strategies were developed using open-source datasets and training pipelines, and both models are optimized to run on consumer-grade graphics processing units (GPUs).

Polaris Innovations: Difficulty Balancing, Controlled Sampling, and Long-Context Inference

Polaris implements several innovations:

The training data is curated by removing problems that are either too easy or unsolvable, resulting in a mirrored J-shape distribution of difficulty. This ensures that the training data evolves alongside the model’s capabilities.
The researchers dynamically adjust the sampling temperature across training stages—using 1.4, 1.45, and 1.5 for Polaris-4B and 0.7, 1.0, and 1.1 for Polaris-7B—to maintain rollout diversity.
The method employs a Yarn-based extrapolation technique to extend the inference context length to 96K tokens without requiring additional training, enabling a “train-short, test-long” approach.
Techniques such as the Rollout Rescue Mechanism and Intra-Batch Informative Substitution are utilized to prevent zero-reward batches and ensure that useful training signals are preserved, even when the rollout size is kept small at 8.

Benchmark Results: Polaris Outperforms Larger Commercial Models

Polaris models achieve state-of-the-art results across multiple math benchmarks. Polaris-4B-Preview records 81.2% accuracy on AIME24 and 79.4% on AIME25, outperforming even Qwen3-32B on the same tasks while utilizing less than 2% of its parameters. It scores 44.0% on Minerva Math, 69.1% on Olympiad Bench, and 94.8% on AMC23. Polaris-7B-Preview also performs strongly, scoring 72.6% on AIME24 and 52.6% on AIME25. These results demonstrate consistent improvement over models such as Claude-4-Opus and Grok-3-Beta, establishing Polaris as a competitive, lightweight model that bridges the performance gap between small open models and commercial 30B+ models.

Conclusion: Efficient Reinforcement Learning Through Smart Post-Training Strategies

The researchers have shown that scaling reasoning models effectively hinges not only on larger model sizes but also on intelligent control over training data difficulty, sampling diversity, and inference length. Polaris offers a reproducible recipe that effectively tunes these elements, allowing smaller models to rival the reasoning abilities of large commercial systems.

Check out the Model and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.