Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable Queries

Researchers from Salesforce have unveiled UAEval4RAG, a framework aimed at enhancing the evaluation of Retrieval-Augmented Generation (RAG) systems, specifically focusing on their capacity to reject unanswerable queries. Traditional evaluation frameworks primarily assess accuracy and relevance concerning answerable questions but often overlook the critical capability of these systems to identify and reject unsuitable or unanswerable requests. This oversight poses significant risks in real-world applications, where inappropriate responses can result in misinformation or potential harm.

Current benchmarks for unanswerable queries have proven inadequate for RAG systems, as they typically consist of static, general requests that fail to adapt to specific knowledge bases. When RAG systems do reject queries, it is frequently due to retrieval failures rather than an accurate assessment of the requests’ validity. This highlights a significant gap in existing evaluation methodologies.

Insights from unanswerable benchmark research have explored model noncompliance, particularly in addressing ambiguous questions and underspecified inputs. While recent advancements in RAG evaluation have introduced various LLM-based techniques, such as RAGAS and ARES for evaluating document relevance, existing methods still fall short in comprehensively assessing RAG systems’ rejection capabilities across diverse unanswerable requests.

Introducing UAEval4RAG

The newly proposed UAEval4RAG framework synthesizes datasets of unanswerable requests tailored for any external knowledge database, enabling automated evaluations of RAG systems. This innovative approach assesses not only the systems’ responses to answerable queries but also their ability to reject six distinct categories of unanswerable requests:

Underspecified
False-presuppositions
Nonsensical
Modality-limited
Safety Concerns
Out-of-Database

To facilitate this, researchers developed an automated pipeline that generates diverse and challenging requests suitable for any given knowledge base. The resulting datasets are utilized to evaluate RAG systems using two LLM-based metrics: Unanswerable Ratio and Acceptable Ratio.

UAEval4RAG also investigates how varying components of RAG systems impact performance on both answerable and unanswerable queries. Testing across 27 combinations of embedding models, retrieval models, rewriting methods, and prompting techniques revealed that no single configuration optimized performance across all datasets due to differences in knowledge distribution. Notably, the selection of LLM proved critical, with Claude 3.5 Sonnet enhancing correctness by 0.4% and improving the unanswerable acceptable ratio by 10.4% compared to GPT-4o. Furthermore, optimal prompt design significantly boosted performance, improving unanswerable query handling by 80%.

Evaluation Metrics

Three key metrics evaluate RAG systems’ ability to reject unanswerable requests:

Acceptable Ratio
Unanswered Ratio
Joint Score

The efficacy of UAEval4RAG has been demonstrated with a 92% accuracy in generating unanswerable requests, alongside strong inter-rater agreement scores of 0.85 and 0.88 for the TriviaQA and Musique datasets, respectively. The LLM-based metrics have shown robust performance with high accuracy and F1 scores across three LLMs, validating their reliability in evaluating RAG systems irrespective of the underlying model used.

A comprehensive analysis indicated that no single combination of RAG components excels across all datasets, while prompt design significantly impacts hallucination control and the ability to reject queries. The characteristics of the datasets revealed that modality-related performance correlates with keyword prevalence (18.41% in TriviaQA versus 6.36% in HotpotQA), and handling safety-concerned requests is influenced by the availability of relevant chunks per question.

Conclusion and Future Directions

In conclusion, UAEval4RAG addresses a critical gap in existing evaluation methods by focusing on RAG systems’ ability to manage unanswerable requests. Future work could enhance generalizability by integrating a broader range of human-verified sources. While the proposed metrics have shown strong alignment with human evaluations, further tailoring them to specific applications could improve effectiveness. Additionally, expanding the framework to accommodate multi-turn dialogues would provide more realistic assessments of how systems engage in clarifying exchanges to manage underspecified or ambiguous queries.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.