←back to Blog

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Reinforcement finetuning (RFT) employs reward signals to guide large language models (LLMs) toward producing desirable outputs. This method enhances the model’s ability to generate logical and structured responses by reinforcing correct answers. However, a significant challenge remains: ensuring these models can recognize when to refrain from responding, particularly in cases of incomplete or misleading questions.

The issue arises when LLMs, after undergoing reinforcement finetuning, begin to lose their capacity to decline answering unclear or ambiguous queries. Instead of indicating uncertainty, these models often produce confidently stated but incorrect responses. This phenomenon, termed the “hallucination tax,” underscores a growing risk. As models are trained to improve performance, they may also become more prone to hallucinating answers when silence would be more appropriate, especially in high-stakes domains requiring accuracy and trust.

Current tools used for training LLMs often neglect the importance of refusal behavior. RFT frameworks typically reward only correct answers while penalizing incorrect ones, overlooking instances where a valid response should be no answer at all. Consequently, the existing reward systems do not adequately reinforce refusal, leading to overconfident models. For example, the research indicates that refusal rates dropped to nearly zero across multiple models after standard RFT, demonstrating that current training methods fail to address hallucination effectively.

Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset to tackle this issue. SUM introduces implicitly unanswerable math problems by modifying existing questions based on criteria such as missing key information or creating logical inconsistencies. The researchers utilized DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond appropriately.

The core technique of SUM involves mixing answerable and unanswerable problems during training. Questions are altered to become ambiguous or unsolvable while still appearing plausible. Training prompts instruct models to say “I don’t know” for unanswerable inputs. By incorporating only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure enables them to refuse answers more appropriately without compromising their performance on solvable problems.

Performance analysis reveals significant improvements. After training with SUM, the Qwen2.5-7B model increased its refusal rate from 0.01 to 0.73 on the SUM benchmark and from 0.01 to 0.81 on the UMWP benchmark. On the SelfAware dataset, refusal accuracy surged from 0.01 to 0.94. Similarly, Llama-3.1-8B-Instruct exhibited a trend of improvement, with refusal rates rising from 0.00 to 0.75 on SUM and from 0.01 to 0.79 on UMWP. Despite these enhancements in refusal behavior, accuracy on answerable datasets, such as GSM8K and MATH-500, remained stable, with most changes ranging from 0.00 to -0.05. This minimal drop indicates that refusal training can be integrated without significant sacrifices in task performance.

This study highlights a clear trade-off between improved reasoning and trustworthiness. While reinforcement finetuning is a powerful tool, it often suppresses cautious behavior. The SUM dataset addresses this by teaching models to recognize their limitations. With only a small addition to training data, language models can better identify the boundaries of their knowledge. This approach represents a significant advancement in making AI systems not only smarter but also more careful and honest.

Check out the Paper and Dataset on Hugging Face. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.