←back to Blog

Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

Understanding the Target Audience

The target audience for the introduction of Group Sequence Policy Optimization (GSPO) primarily includes AI researchers, data scientists, machine learning engineers, and business leaders in tech companies. These individuals are typically involved in the development and deployment of large language models (LLMs) and seek to enhance their performance and efficiency.

Pain Points: The audience faces challenges such as instability in training dynamics, inefficiencies in existing reinforcement learning algorithms, and the complexities of scaling LLMs. They are particularly concerned about catastrophic failures during model training and the high variance noise introduced by current algorithms.

Goals: Their primary goals include achieving stable and efficient training of LLMs, reducing computational costs, and improving model performance in complex tasks. They aim to leverage advanced algorithms like GSPO to enhance their AI capabilities.

Interests: The audience is interested in the latest advancements in AI, particularly in reinforcement learning, algorithm optimization, and the practical applications of LLMs in business contexts. They also value peer-reviewed research and case studies demonstrating successful implementations.

Communication Preferences: This audience prefers clear, concise, and technical communication that includes empirical data and practical examples. They appreciate detailed explanations of algorithms and their implications for real-world applications.

Overview of GSPO

Reinforcement learning (RL) is essential for scaling language models, enabling them to tackle complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics poses challenges, particularly when scaling RL with larger computational resources.

Current state-of-the-art algorithms, such as GRPO, encounter significant stability issues during the training of large language models, often leading to catastrophic failures. These instabilities stem from improper applications of importance sampling weights, which introduce high-variance noise. This noise accumulates with longer responses and is exacerbated by clipping mechanisms, resulting in model collapse and hindering progress.

Existing methods like PPO and GRPO utilize clipping to tackle off-policy learning challenges, where responses are sourced from outdated policies. However, these approaches have limitations due to their poorly defined objectives, especially in large models managing long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration have proven ineffective, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards underscores the necessity for a novel approach that optimizes directly at the sequence level to ensure stability and scalability.

Introducing Group Sequence Policy Optimization (GSPO)

Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm designed to train large language models (LLMs). GSPO’s primary innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Additionally, it calculates normalized rewards as advantages for multiple responses to a query, fostering consistency between sequence-level rewards and optimization goals.

Empirical evaluations indicate that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By addressing stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

For the experiments, researchers employed a cold-start model fine-tuned from Qwen3-30B-A3B-Base, reporting training reward curves and model performance across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is divided into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This approach results in a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency, underscoring the inefficiency of GRPO’s noisy token-level estimates.

GSPO provides significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This stability removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, enhancing robustness to precision mismatches. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

Conclusion

In summary, researchers introduced Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm designed to enhance the training of large language models. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability—particularly for MoE models—highlights its significance as a robust algorithmic foundation. The advancements enabled by GSPO have played a crucial role in the impressive performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, paving the way for groundbreaking progress in AI.

Check out the Paper. Feel free to check out our GitHub Page for tutorials, codes, and notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and subscribe to our newsletter.