←back to Blog

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems

«`html

Understanding the Target Audience

The target audience for this content encompasses AI researchers, data scientists, and business managers interested in the application of AI technologies, specifically in the context of reinforcement learning and language models. This group is typically characterized by the following attributes:

  • Pain Points: Difficulty in implementing effective learning strategies for smaller language models, challenges in deploying AI solutions that can perform complex reasoning tasks effectively, and concerns about the limitations of existing reinforcement learning methods.
  • Goals: To identify actionable strategies for enhancing AI capabilities and to remain competitive in the rapidly evolving AI landscape. They are particularly interested in frameworks that provide measurable improvements in model performance.
  • Interests: Advancements in AI frameworks, practical applications of AI in business contexts, and new methodologies in machine learning that can drive efficiency and effectiveness.
  • Communication Preferences: Preference for concise, technical content that includes empirical data and real-world applications. They favor detailed explanations with a focus on implementation and results.

Overview of Supervised Reinforcement Learning (SRL)

The research team from Google Cloud AI Research and UCLA has introduced a novel training framework known as Supervised Reinforcement Learning (SRL). This framework represents a significant advancement in teaching small language models to address difficult problems without falling into the pitfalls of rote imitation or reliance on correct rollouts. The SRL framework leverages a structure that retains reinforcement learning optimization while incorporating supervision into the reward channel.

Key Features of SRL

SRL operates by processing expert trajectories from datasets such as s1K 1.1, which allows for a more nuanced learning experience for AI models. The model generates intermediate reasoning outputs, encapsulated in tags, prior to producing the final action. This design ensures that even incorrect final answers contribute to the learning process by providing dense rewards based on action sequence similarity, rather than depending solely on overall outcomes.

Mathematical Results

All models were initialized from Qwen2.5 7B Instruct and trained on the DeepSeek R1 formatted s1K 1.1 dataset. The performance comparison shows:

  • Base Qwen2.5 7B Instruct: AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7.
  • SRL model performance: AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3.
  • SRL followed by RLVR performance: AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0.

Software Engineering Results

When applying SRL to the Qwen2.5 Coder 7B Instruct model, using verified agent trajectories, the results indicated significant improvements:

  • Base model performance: 5.8% in oracle file edit mode, 3.2% end-to-end.
  • SWE Gym 7B performance: 8.4% in oracle mode, 4.2% end-to-end.
  • SRL performance: 14.8% in oracle mode, 8.6% end-to-end.

Key Takeaways

The implementation of SRL has redefined how models can be trained to perform complex reasoning. The important aspects include:

  • SRL reframes intricate reasoning tasks as step-wise action generation.
  • Internal reasoning processes are uncoupled from final outputs, allowing for more flexible learning.
  • The combination of SRL followed by RLVR offers the best results, suggesting a clear pathway for other models to emulate.
  • SRL is a versatile solution applicable across various domains, including software engineering.

Conclusion

Supervised Reinforcement Learning provides a practical approach for improving the performance of small language models in solving challenging tasks. The benefits outlined in the research suggest that SRL is an effective bridge between process supervision and reinforcement learning, making it a valuable framework for teams looking to enhance the capabilities of open models.

«`