←back to Blog

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

Can LLMs Really Judge with Reasoning?

Introduction

Recent advancements in large language models (LLMs) have brought attention to their capabilities in reasoning and judgment. Researchers from Microsoft and Tsinghua University have introduced Reward Reasoning Models (RRMs), which aim to enhance the alignment of LLMs through dynamic scaling of computational resources during test-time evaluations.

The Role of Reinforcement Learning in LLMs

Reinforcement learning (RL) plays a pivotal role in the post-training of LLMs, leveraging either human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows potential in mathematical reasoning, its application is hindered by the necessity for training queries with verifiable answers, which limits its use to general-domain queries where verification is infeasible.

Challenges with Current Reward Models

Current reward models can be classified into scalar and generative types. Scalar models assign numeric scores to query-response pairs, while generative models provide natural language feedback. However, these models often utilize uniform computational resources across inputs, failing to adaptively allocate additional resources to more complex queries.

Introducing Reward Reasoning Models (RRMs)

To address these limitations, RRMs focus on explicit reasoning prior to reward assignment. By performing a reasoning phase, RRMs can adaptively allocate computational resources for evaluating responses to complex tasks. This approach allows for enhanced reward modeling and supports diverse evaluation scenarios.

Technical Specifications and Business Applications

RRMs leverage the Qwen2 model with a Transformer-decoder architecture, framing reward modeling as a text completion task. They autoregressively generate reasoning processes followed by final judgments. Each input consists of a query and two responses, with a preference determined without ties.

The RewardBench repository guides systematic analysis across various evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs facilitate multi-response evaluation through ELO rating systems and knockout tournaments, enhancing the utilization of test-time compute.

Performance Evaluation

Evaluation results indicate that RRMs achieve competitive performance against robust baselines on RewardBench and PandaLM Test benchmarks. The RRM-32B model attains an accuracy of 98.6% in reasoning categories. Comparisons with DirectJudge models reveal significant performance advantages, underscoring the effectiveness of RRMs in utilizing test-time compute for complex queries.

In scenarios like reward-guided best-of-N inference, RRMs have shown superior performance over all baseline models without requiring additional test-time compute. Majority voting further enhances results across evaluated subsets. Additionally, post-training experiments demonstrate consistent downstream performance improvements on MMLU-Pro and GPQA.

Conclusion

The introduction of RRMs marks a significant step in the evolution of reward modeling in LLMs. By performing explicit reasoning before reward assignment, RRMs address computational inflexibility in existing models. This approach allows for the development of complex reasoning capabilities without relying on explicit reasoning traces as supervision. The adaptability of RRMs in practical applications highlights their potential as a robust alternative to traditional scalar reward models.

For more information, check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Follow us on Twitter, and don’t forget to join our 95k+ ML SubReddit and subscribe to our newsletter.