←back to Blog

Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)

Prefix-RFT: A Unified Machine Learning Framework to Blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)

Understanding the Target Audience

The target audience for Prefix-RFT includes machine learning researchers, data scientists, and business leaders interested in the application of advanced machine learning techniques. Their pain points often revolve around the limitations of existing fine-tuning methods, such as the rigidity of SFT and the instability of RFT. They seek solutions that enhance model performance while maintaining adaptability and efficiency. Their goals include improving the accuracy of large language models (LLMs) in real-world applications, optimizing resource use, and achieving better generalization across diverse tasks. This audience prefers clear, technical communication that includes data-driven insights and practical applications.

The Need for a Unified Framework

Large language models are typically refined after pretraining using either supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), each with distinct strengths and limitations. SFT is effective in teaching instruction-following through example-based learning, but it can lead to rigid behavior and poor generalization. RFT, on the other hand, optimizes models for task success using reward signals, which can improve performance but also introduce instability and reliance on a strong starting policy. While these methods are often used sequentially, their interaction remains poorly understood. This raises an important question: how can we design a unified framework that combines SFT’s structure with RFT’s goal-driven learning?

Research Insights

Research at the intersection of reinforcement learning (RL) and LLM post-training has gained momentum, particularly for training reasoning-capable models. Offline RL, which learns from fixed datasets, often yields suboptimal policies due to the limited diversity of the data. This has sparked interest in combining offline and online RL approaches to improve performance. In LLMs, the dominant strategy is to first apply SFT to teach desirable behaviors, then use RFT to optimize outcomes. However, the dynamics between SFT and RFT are still not well understood, and finding effective ways to integrate them remains an open research challenge.

Introducing Prefix-RFT

Researchers from the University of Edinburgh, Fudan University, Alibaba Group, Stepfun, and the University of Amsterdam propose a unified framework called Prefix-RFT. This method guides exploration using partial demonstrations, allowing the model to continue generating solutions with flexibility and adaptability. Tested on math reasoning tasks, Prefix-RFT consistently outperforms standalone SFT, RFT, and mixed-policy methods. It integrates easily into existing frameworks and proves robust to changes in demonstration quality and quantity. Blending demonstration-based learning with exploration can lead to more effective and adaptive training of large language models.

Technical Specifications

Prefix-RFT is a reward fine-tuning method that improves performance using high-quality offline math datasets, such as OpenR1-Math-220K (46k filtered problems). It was tested on models including Qwen2.5-Math-7B, 1.5B, and LLaMA-3.1-8B, evaluated on benchmarks including AIME 2024/25, AMC, MATH500, Minerva, and OlympiadBench. Prefix-RFT achieved the highest avg@32 and pass@1 scores across tasks, outperforming RFT, SFT, ReLIFT, and LUFFY. Using Dr. GRPO, it updated only the top 20% high-entropy prefix tokens, with the prefix length decaying from 95% to 5%. It maintained intermediate SFT loss, indicating a strong balance between imitation and exploration, especially on difficult problems.

Conclusion

In conclusion, Prefix-RFT combines the strengths of SFT and RFT by utilizing sampled demonstration prefixes to guide learning. Despite its simplicity, it consistently outperforms SFT, RFT, and hybrid baselines across various models and datasets. Even with only 1% of the training data (450 prompts), it maintains strong performance (avg@32 drops only from 40.8 to 37.6), showing efficiency and robustness. Its top-20% entropy-based token update strategy proves most effective, achieving the highest benchmark scores with shorter outputs. Moreover, using a cosine decay scheduler for prefix length enhances stability and learning dynamics compared to a uniform strategy, particularly on complex tasks.

Further Reading

Check out the Paper for more detailed insights. Feel free to check out our GitHub Page for Tutorials, Codes, and Notebooks. Also, follow us on Twitter and join our 100k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.