Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning

Sakana AI has introduced a new framework for enhancing reasoning in language models (LLMs) called Reinforcement-Learned Teachers (RLTs). This approach focuses on efficiency and reusability, addressing significant challenges in traditional reinforcement learning (RL) methods.

Understanding the Target Audience

The primary audience for this framework includes:

Data Scientists and AI Researchers: Interested in improving model performance and efficiency.
Business Managers: Seeking practical applications of AI to enhance productivity and decision-making.
Technical Decision-Makers: Responsible for implementing AI solutions within organizations.

Pain Points: High computational costs and inefficiencies in current RL models.

Goals: To achieve better performance with lower resource consumption and to enhance model interpretability.

Interests: Innovations in AI, particularly those that can be applied in business settings.

Communication Preferences: Technical insights presented clearly, with a focus on practical applications and outcomes.

Revisiting Reinforcement Learning for Teaching

Traditional RL models are designed to solve problems autonomously using sparse, correctness-based rewards. However, this creates a disconnect between the objective of solving tasks and the actual need for teaching smaller models. RLTs tackle this issue by prompting models with both the problem and its solution, requiring them to generate detailed pedagogical explanations. This results in a dense, student-aligned reward signal that measures how well the student model understands the explanation and reproduces the solution.

Core Concept: Dense, Student-Aligned Rewards

The RLT training objective comprises two key reward terms:

Solution Score (rSS): Assesses the student’s ability to reconstruct the correct solution given the explanation and the problem.
Explanation Score (rKL): Evaluates the logical coherence of the teacher’s explanation from the student’s perspective.

These components create a dense reward signal that promotes instructive and comprehensible explanations, effectively overcoming traditional RL’s exploration bottleneck.

Surprising Efficacy of Small Teachers

Sakana AI demonstrates that a 7B parameter RLT can outperform significantly larger LLMs (e.g., 32B+ models) on distillation tasks across various datasets, including AIME 2024, MATH 500, and GPQA Diamond. Notably:

RLT-7B surpasses DeepSeek R1, Bespoke-7B, and even post-processed RL traces on a 17K-question corpus.
RLT-32B outperforms all 32B baselines, despite being distilled from a smaller teacher.

The advantages extend beyond parameter efficiency, as RLTs achieve better generalization, fewer formatting errors, and increased interpretability.

Cold-Starting Reinforcement Learning with RLTs

RLTs also play a crucial role in RL cold-starting, where an initial model is enhanced with external data before formal RL training. The traces generated by RLTs are more effective cold-start material than those from larger RL-trained models, leading to improved performance gains during RL fine-tuning.

Out-of-Domain Generalization and Zero-Shot Transfer

RLTs exhibit strong zero-shot transfer capabilities. When applied to new domains, such as the arithmetic-based “Countdown” task, the RLT-trained traces enable student models to exceed performance expectations compared to direct RL on the new domain. This suggests that the skill of explaining a solution generalizes more readily across tasks compared to solving from scratch.

Training Pipeline: Efficient and Scalable

The training process is computationally efficient, requiring only:

250 RL steps (~1 epoch)
Batch size: 256
Group size: 64

This setup is executed using a single-node arrangement with Qwen2.5-7B-Instruct. Unlike traditional RL pipelines, RLTs do not need post-processing, formatting corrections, or verification filters—raw outputs are immediately usable.

Evaluation Highlights

Overall, Sakana AI’s RLT framework offers a scalable blueprint for building reasoning-capable LLMs using modest compute resources and open-source tools.

For further technical details, refer to the original research paper. You can also follow updates on Twitter and join the ML community on Reddit.