From Exploration Collapse to Predictable Limits: Shanghai AI Lab Proposes Entropy-Based Scaling Laws for Reinforcement Learning in LLMs

Recent advances in reasoning-centric large language models (LLMs) have expanded the scope of reinforcement learning (RL), enabling broader generalization and reasoning capabilities. However, this shift introduces significant challenges, particularly in scaling the training compute required for learning from experience. Unlike imitation learning, RL demands a more computationally intensive approach. A central issue is the decline in policy entropy, which affects the balance between exploiting known strategies and exploring new ones. This exploitation-exploration trade-off is fundamental in RL, and controlling policy entropy has become critical to maintaining effective exploration during training.

Existing methods address the exploration-exploitation trade-off in RL by utilizing policy entropy. Maximum entropy RL introduces a regularization term to the reward function, promoting uncertainty in action selection and encouraging broader exploration. While this technique has been widely adopted in conventional RL algorithms, its application to LLMs remains debated. Furthermore, predictability in RL for LLMs is not adequately explored. While neural scaling laws guide LLM development, similar predictive frameworks for RL training are limited. Current RL methods for LLMs with verifiable rewards show promise in improving reasoning but lack a deep understanding of their core mechanisms.

Researchers from Shanghai AI Laboratory, Tsinghua University, UIUC, Peking University, Nanjing University, and CUHK propose an approach to address the collapse of policy entropy in RL for reasoning-centric LLMs. They established a transformation equation: R = −a exp H + b, where H is entropy, R is downstream performance, and a and b are fitting coefficients. This empirical law suggests that policy performance is traded off with policy entropy, thus bottlenecked by its exhaustion. Researchers investigate entropy dynamics, highlighting that changes in policy entropy are driven by the covariance between action probability and changes in logits. They proposed two techniques—Clip-Cov and KL-Cov—that clip and apply a KL penalty to tokens with high covariances, respectively.

To validate the entropy collapse phenomenon in RL-tuned LLMs, researchers applied RL to LLMs on verifiable tasks like math and coding. This involved an autoregressive generation setup where models produce token sequences based on input prompts. The study encompassed 11 widely adopted open-source models across four families: Qwen2.5, Mistral, LLaMA, and DeepSeek, with parameters ranging from 0.5B to 32B. Evaluations were performed on eight public benchmarks, including MATH500, AIME 2024, AMC, and Eurus-2-RL-Code. RL training followed the veRL framework in a zero-shot setting, utilizing algorithms like GRPO, REINFORCE++, and PRIME to optimize policy performance while observing entropy dynamics.

The proposed Clip-Cov and KL-Cov techniques were evaluated on the Qwen2.5 models using the DAPOMATH dataset for math tasks. These methods achieved non-trivial performance gains across all benchmarks. Compared to the GRPO baseline, these methods improved performance by an average of 2.0% for the 7B model and 6.4% for the 32B model. Notably, when the baseline’s entropy plateaus, the KL-Cov method maintains an entropy level over ten times higher. The techniques yield more substantial gains on the larger Qwen2.5-32B model, with improvements of 15.0% and 14.6% compared to GRPO on the most challenging benchmarks, AIME24 and AIME25, respectively.

In conclusion, researchers have addressed the challenge of policy entropy collapse in RL for reasoning-centric LLMs. The findings highlight a trade-off between performance improvement and diminished exploration, ultimately limiting further gains. Through theoretical analysis and empirical validation, researchers identify entropy dynamics as a key bottleneck and propose two effective regularization strategies—Clip-Cov and KL-Cov—to manage high-covariance tokens and sustain exploration. As RL becomes crucial for scaling beyond pre-training, addressing entropy collapse is essential. This work provides foundational insights into the role of entropy, guiding future efforts to scale RL toward more intelligent and capable language models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.