Understanding Limitations of Current Reward Models
Although reward models are crucial in Reinforcement Learning from Human Feedback (RLHF), many of today’s top-performing open models struggle to reflect the full range of complex human preferences. Even with advanced training techniques, meaningful progress has been limited. A major reason for this is the shortcomings in current preference datasets, which tend to be too narrow, artificially generated, or poorly vetted. While some rule-based systems work effectively for clear tasks like math or coding, they typically fail to capture nuanced human judgment. Additionally, common benchmarks like RewardBench are becoming less reliable indicators of real-world RM performance, showing poor correlation with downstream task success.
Challenges in Preference Data Creation and New Approaches
Traditionally, creating high-quality preference data has relied on human annotators, a method that is time-consuming, costly, and sometimes inconsistent. Recent techniques, such as RLAIF, utilize large language models (LLMs) to automate annotations, often outperforming human annotators. Newer approaches aim to combine the strengths of both methodologies by integrating LLM-generated data with human-verified labels. Moreover, reward models have evolved from simple scoring systems, such as the Bradley-Terry model, to more complex frameworks, including generative and direct optimization methods. Despite the availability of numerous robust open models and datasets, challenges persist in accurately capturing nuanced human preferences across diverse tasks and languages.
Introducing SynPref-40M: Large-Scale Human-AI Preference Dataset
Researchers from 2050 Research and Skywork AI introduce SynPref-40M, a massive dataset of 40 million preference pairs curated through a two-stage human-AI pipeline. Human annotators ensure quality through strict verification, while LLMs enhance data curation using human guidance. This effort leads to the development of Skywork-Reward-V2, a family of eight reward models (0.6B–8B parameters) trained on a high-quality subset of 26 million preference pairs. These models achieve state-of-the-art results across seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. The study emphasizes that success stems not just from data volume but from careful, iterative curation that blends human expertise with AI scalability.
Scalable Two-Stage Human-AI Curation Pipeline
Current open reward models often suffer from overfitting to narrow benchmarks, such as RewardBench, limiting their real-world usefulness. To address this, the researchers introduce a two-stage human-AI pipeline for curating large-scale preference data. The first stage involves human-verified annotations to guide LLMs in labeling diverse preference attributes, followed by iterative training and error analysis to refine the reward model. The second stage scales this process using consistency checks between the best and a human-trained “gold” reward model, filtering reliable samples without further human input. This approach strikes a balance between quality and scalability, enabling the creation of tens of millions of high-quality preference pairs.
Benchmarking Skywork-Reward-V2: Compact Yet Powerful Models
The Skywork-Reward-V2 series demonstrates strong performance across multiple benchmarks, outperforming both larger models (e.g., 70B parameters) and emerging generative reward models. Trained using Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve high scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with an average score of 88.6. Despite smaller model sizes, Skywork-Reward-V2 models benefit from high-quality preference data (SynPref-40M) and efficient training setups, enabling them to generalize better in real-world RLHF scenarios. Notably, even mid-sized models like the Qwen3-1.7B outperform some 70B models, emphasizing the impact of training data quality and methodology over sheer parameter count.
Conclusion and Future Outlook: Scaling with Precision
In conclusion, SynPref-40M is a large-scale preference dataset built through a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Using a curated subset of 26 million preference pairs, the team developed Skywork-Reward-V2, a suite of eight reward models (0.6B–8B parameters) that outperform existing models across seven key benchmarks. These models exhibit strong generalization in aligning with human values, ensuring correctness, safety, and robustness to bias. Extensive studies confirm that both data quality and curation method are key performance drivers. Looking forward, the researchers aim to explore new training strategies as reward models become central to LLM development and alignment.
Further Reading
Check out the Paper, Model on Hugging Face, and GitHub Page. All credit for this research goes to the researchers of this project.
Also, feel free to follow us on Twitter, YouTube, and Spotify. Join our 100k+ ML SubReddit and subscribe to our Newsletter.