«`html

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting

Large Language Models (LLMs) have demonstrated significant advancements in complex reasoning tasks through Chain-of-Thought (CoT) prompting combined with large-scale reinforcement learning (RL). Models such as Deepseek-R1-Zero have exhibited strong reasoning capabilities by applying RL directly to base models. Similarly, methods like SimpleRL and Open-ReasonerZero show improvements in smaller models, including the Qwen series. However, achieving consistent success across different base model families remains a challenge. The difficulty of applying R1-Zero-style training to base models, such as the Llama series, raises essential questions regarding the underlying factors that cause varying behaviors in different models during reinforcement learning.

Limitations of RL Scaling on Llama Models

While large-scale RL advancements have been seen in models like OpenAI’s o1 and o3, and DeepSeek’s R1, there is motivation to explore RL on smaller models with less than 100B parameters. However, these efforts are primarily limited to the Qwen model family, making replication of results on families such as Llama challenging. The lack of transparency in pre-training pipelines complicates the understanding of how pre-training influences RL scaling. Unconventional studies have indicated that one-shot prompting enhances reasoning in Qwen models, but offers minimal benefit to Llama models. Efforts to curate high-quality mathematical pre-training corpora through initiatives like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made strides, yet remain constrained in scale under 100B tokens.

Exploring Mid-Training with Stable-then-Decay Strategy

Researchers from Shanghai Jiao Tong University have investigated how mid-training strategies affect RL dynamics, particularly focusing on Qwen and Llama models. The study presents several findings:

High-quality mathematical corpora, such as MegaMath-Web-Pro, enhance both base model and RL outcomes.
Utilizing QA-style data, especially with long CoT reasoning, further boosts RL results.
Long CoT prompts can introduce verbosity and instability in RL training.
Applying scaling during mid-training leads to improved downstream RL performance.

Researchers introduced a two-stage mid-training strategy called Stable-then-Decay, where base models undergo training on 200B tokens, followed by 20B tokens across three CoT-focused branches. This approach resulted in the development of OctoThinker models, which exhibit strong compatibility with RL.

RL Configuration and Benchmark Evaluation

The MATH8K dataset was used for RL training prompts, with a configuration that includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. Evaluation employed few-shot prompting for base language models and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models displayed increasing response lengths that remained reasonable, while Llama exhibited abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation results revealed that the RL-tuned Qwen2.5-3B achieved improvements across benchmarks, while Llama-3.2-3B demonstrated only marginal gains.

OctoThinker Outperforms Llama in RL Compatibility

Each branch of OctoThinker showed a 10%-20% improvement over the original Llama base model, with consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families revealed varied thinking behaviors during RL scaling, with the OctoThinker-Long variant performing strongly. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B surpassed the original Llama-3.2-3B model and achieved performance parity with Qwen2.5-3B, known for its robust reasoning capabilities and extensive pre-training. The hybrid and short branches displayed slightly lower performance, particularly on challenging benchmarks.

Conclusion and Future Work: Toward RL-Ready Foundation Models

This research highlights the reasons behind the divergent behaviors of base models like Llama and Qwen during RL for reasoning, emphasizing the significant role of mid-training in RL scalability. The two-stage mid-training strategy effectively transforms Llama into a foundation model better suited for RL, culminating in the development of OctoThinker models. Future research directions include:

Curating higher-quality mathematical corpora to enhance mid-training.
Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.
Separating the QA format and content to individually assess their contributions.
Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.

All credit for this research goes to the researchers of this project. Check out the Paper, Hugging Face Page, and GitHub Page. Feel free to follow us on Twitter and join our 100k+ ML SubReddit and subscribe to our Newsletter.

«`