Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency

Recent advancements in large language models (LLMs) demonstrate their capability to perform complex reasoning tasks and efficiently utilize external tools such as search engines. A significant challenge remains in teaching models when to rely on internal knowledge versus when to perform a search. While prompt-based methods can guide model behavior, LLMs often struggle with nuanced decision-making, such as recognizing when an initial search has produced inaccurate results and when to initiate a new search.

Reinforcement learning (RL) has been explored to enhance these abilities by rewarding effective use of search tools. However, traditional RL approaches may lead to excessive tool usage, with models executing redundant searches for simple queries, indicating inefficiencies that need to be addressed.

Various RL strategies, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), have been employed to align LLM behavior with human preferences. PPO balances exploration with policy stability, while DPO directly optimizes model responses based on user preferences. GRPO introduces group-based evaluations to capture subtle improvements in reasoning. Furthermore, treating LLMs as autonomous agents capable of planning and executing multi-step reasoning tasks has gained traction, with frameworks like AutoGPT and LangChain illustrating how these agents can refine their outputs through iterative reasoning and search. Despite this, existing agent systems often rely on fixed prompts or heuristic-based tool usage, which can limit their adaptability and efficiency.

Researchers at Ant Group introduced SEM, a post-training reinforcement learning framework designed to teach LLMs when to use search tools versus relying on internal knowledge. By training on a balanced dataset, which includes questions that require external retrieval and those answerable from prior knowledge, SEM guides models to issue search requests only when necessary. The framework employs a structured reasoning format and GRPO, rewarding accurate answers without search and penalizing unnecessary tool usage. Results indicate that SEM enhances response accuracy and efficiency, enabling models to better assess when external information is required, thus improving reasoning in complex scenarios.

To facilitate effective integration of search tools into a model’s reasoning process, SEM utilizes reinforcement learning to instruct models on when and how to conduct searches. The training dataset incorporates Musique (questions that need external information) and MMLU (questions that can be answered with prior knowledge). Using the GRPO framework, models are rewarded for accurate and efficient responses, while discouraging unnecessary searches and promoting external searches when internal knowledge is inadequate. A structured response format (, ,

, ) standardizes training and enhances reward assignment, boosting both reasoning quality and decision-making regarding search.

The study assesses a model trained to determine when to leverage its internal knowledge and when to engage in external searches. Combining Musique and MMLU datasets for training, the model’s performance was evaluated using benchmarks like HotpotQA, GSM8K, and MMLU. The SEM method consistently outperformed baselines such as Naive RAG and ReSearch in accuracy and search efficiency. SEM effectively reduces unnecessary searches on familiar queries while improving reasoning on unfamiliar ones, with case studies and training curves confirming its stable learning and intelligent decision-making capabilities.

In conclusion, SEM is a post-training reinforcement learning framework designed to enhance how large language models utilize external search tools. By training on a dataset that includes MuSiQue and MMLU, the model learns to distinguish between questions it can answer internally and those requiring external information. SEM employs a structured reasoning approach and a reward system that penalizes unnecessary searches while encouraging accurate retrievals. Experiments on benchmarks like HotpotQA, GSM8K, and MMLU demonstrate that SEM reduces redundant searches and improves accuracy, thereby enhancing reasoning efficiency and the intelligent use of external knowledge in LLMs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.