ServiceNow AI Research Releases DRBench, a Realistic Enterprise Deep-Research Benchmark

Understanding the Target Audience

The primary audience for the DRBench benchmark includes AI researchers, enterprise software developers, data scientists, and business analysts, particularly those focused on improving the capabilities of AI agents in enterprise contexts. Their pain points often revolve around the challenge of accurately synthesizing and reporting information from diverse data sources while minimizing the inclusion of irrelevant or misleading information. These professionals aim to enhance the efficiency and reliability of AI tools to support decision-making processes, and they have a strong interest in standardized evaluation methods for AI systems. They prefer clear, concise communication that is data-driven and technical, allowing for efficient understanding and implementation of new tools.

Overview of DRBench

ServiceNow Research has unveiled DRBench, a benchmark and operational environment specifically designed to assess “deep research” agents tackling open-ended enterprise tasks. These tasks necessitate the synthesis of information from both public web sources and proprietary organizational data, culminating in accurately cited reports. Unlike traditional web-centric testbeds, DRBench immerses agents in complex enterprise workflows that encompass a variety of sources such as files, emails, chat logs, and cloud storage. This multifaceted approach requires agents to effectively retrieve, filter, and attribute insights prior to composing a coherent research report.

Components of DRBench

The initial release of DRBench includes 15 deep research tasks across 10 enterprise domains, such as Sales, Cybersecurity, and Compliance. Each task presents a specific deep research question, contextualizes it within a company and persona, and provides a set of groundtruth insights categorized into three classes: public insights (sourced from stable URLs), internal relevant insights, and internal distractor insights. The construction of this dataset employs a combination of large language model (LLM) generation and human verification, totaling 114 groundtruth insights.

Enterprise Environment

A significant feature of DRBench is its containerized enterprise environment that integrates widely-used services behind secure authentication and application-specific APIs. The DRBench Docker image orchestrates the following components:

Nextcloud (shared documents, WebDAV)
Mattermost (team chat, REST API)
Roundcube with SMTP/IMAP (enterprise email)
FileBrowser (local filesystem)
A VNC/NoVNC desktop for graphical user interface interactions

Tasks are initialized by distributing data across these services, with documents directed to Nextcloud and FileBrowser, chats routed to Mattermost channels, and emails provisioned through the mail system. Agents can interact with these resources via both web interfaces and programmatic APIs, simulating a «needle-in-a-haystack» environment where relevant and distractor insights coexist within realistic files and communications.

Evaluation Criteria

DRBench evaluates agent performance across four key axes aligned with analyst workflows:

Insight Recall: This metric assesses the agent’s report by breaking it down into individual insights with citations, matching them against groundtruth insights using an LLM judge.
Distractor Avoidance: This measures the agent’s ability to exclude distractor insights from its report.
Factuality: This evaluates the correctness of the insights presented in the report.
Report Quality: This assesses the structure and clarity of the final report based on a specified rubric.

Baseline Agent and Research Loop

The research team has introduced a task-oriented baseline agent, the DRBench Agent (DRBA), specifically designed to function within the DRBench environment. DRBA consists of four components: research planning, action planning, a research loop incorporating Adaptive Action Planning (AAP), and report writing. The planning phase supports two modes: Complex Research Planning (CRP), which identifies investigation areas and expected sources, and Simple Research Planning (SRP), which generates lightweight sub-queries. The research loop continuously selects tools, processes content (including storage in a vector store), identifies gaps, and iterates until completion or until reaching a specified iteration limit. The report writer synthesizes findings while tracking citations.

Importance for Enterprise Agents

Most existing “deep research” agents perform well on public web question sets; however, their practical usage relies on the capacity to accurately find internal insights while disregarding plausible distractors, all under enterprise constraints such as login requirements and user interface complexities. DRBench directly addresses these challenges by embedding tasks in realistic company/persona contexts, distributing evidence across multiple enterprise applications as well as the web, and rigorously scoring the agents on their performance in extracting relevant insights and compiling coherent, factual reports. This makes DRBench an essential benchmark for developers aiming for comprehensive evaluation of AI systems rather than simplistic, micro-level scoring.

Key Takeaways

DRBench is designed to evaluate deep research agents on complex, open-ended enterprise tasks that blend public web data with private company information.
The initial release includes 15 tasks across 10 domains, each tailored to realistic user personas and organizational contexts.
Tasks engage with diverse enterprise artifacts—productivity software, cloud file systems, emails, and chat—as well as the public web, moving beyond conventional web-only setups.
Reports are evaluated based on insight recall, factual accuracy, and the overall quality of reporting through rubric-based assessments.
All code and benchmark assets are available on GitHub, promoting reproducible evaluation and further development.

Conclusion

From an enterprise evaluation perspective, DRBench represents a significant advancement in the standardized testing of “deep research” agents. The tasks are grounded in realistic personas and incorporate the integration of evidence from both public and private sources, ultimately producing coherent and structured reports—key components of the typical workflows that production teams prioritize. Moreover, the clarity in measurement criteria—recall of relevant insights, accuracy, and report quality—enhances its utility, especially when moving beyond web-centric evaluations to encompass the complexities of enterprise environments.

For additional resources, check out the DRBench research paper and explore the GitHub page for tutorials, codes, and notebooks. Engage with the community by following us on Twitter and joining our 100k+ ML SubReddit. Don’t forget to subscribe to our newsletter for updates!