Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents

AI agents powered by large language models (LLMs) demonstrate significant potential for managing complex business tasks, particularly within Customer Relationship Management (CRM). However, evaluating their effectiveness in real-world situations is arduous due to a scarcity of publicly accessible, realistic business data. Past benchmarks primarily center on simple, one-turn interactions or niche applications such as customer service, overlooking broader domains like sales, Configure Price Quote (CPQ) processes, and B2B operations. Furthermore, they often neglect to assess how agents handle sensitive information, making it difficult to gauge LLM agents’ performance across the diverse spectrum of business scenarios and communication styles.

Previous benchmarks have mainly concentrated on customer service tasks within B2C scenarios, disregarding essential business operations, including sales and CPQ processes, along with the unique challenges associated with B2B interactions, like elongated sales cycles. Many existing benchmarks fail to simulate realistic multi-turn dialogue and often skip expert validation of tasks and environments. Another critical shortcoming is the lack of confidentiality evaluations, which are vital in workplaces where AI agents routinely engage with sensitive business and customer data. Without a focus on data awareness, these benchmarks overlook significant practical concerns related to privacy, legal risk, and trust.

Researchers from Salesforce AI Research have developed CRMArena-Pro, a benchmark tailored to realistically assess LLM agents, such as Gemini 2.5 Pro, in professional business environments. This benchmark incorporates expert-validated tasks across customer service, sales, and CPQ, bridging both B2B and B2C contexts. It rigorously tests multi-turn conversations and evaluates confidentiality awareness. Findings indicate that even top-performing models, like Gemini 2.5 Pro, achieve only approximately 58% accuracy in single-turn tasks, with performance declining to around 35% in multi-turn scenarios. The exception is Workflow Execution, where Gemini 2.5 Pro surpasses 83%, yet confidentiality management remains a substantial challenge across all assessed models.

CRMArena-Pro features a comprehensive framework designed to thoroughly evaluate LLM agents in realistic business contexts, including customer service, sales, and CPQ tasks. The benchmark uses synthetic but structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas. It simulates business environments through sandboxed Salesforce Organizations and includes 19 tasks categorized under four essential skills: database querying, textual reasoning, workflow execution, and policy compliance. Additionally, CRMArena-Pro incorporates multi-turn dialogues with simulated users and rigorously tests confidentiality awareness. Expert evaluations have validated the realism of the data and environment, establishing a reliable testbed for LLM agent performance.

The evaluation compared leading LLM agents across 19 business tasks, focusing on task completion rates and confidentiality awareness. Metrics varied by task type; exact match was utilized for structured outputs, while the F1 score was applied for generative responses. A GPT-4o-based LLM Judge evaluated whether models adequately refused to disclose sensitive information. Models such as Gemini 2.5 Pro and o1, equipped with advanced reasoning capabilities, significantly outperformed lighter or non-reasoning versions, particularly in intricate tasks. While performance metrics showed similarity in both B2B and B2C settings, nuanced trends surfaced based on model strength. Confidentiality-aware prompts enhanced refusal rates but occasionally diminished task accuracy, illustrating the trade-off between privacy and performance.

In conclusion, CRMArena-Pro stands as a significant advancement in benchmarking the performance of LLM agents in real-world business tasks related to customer relationship management. Comprising 19 expert-reviewed tasks across both B2B and B2C scenarios, it addresses sales, service, and pricing operations. Although top agents achieved reasonable success in single-turn interactions (around 58%), performance sharply declined to about 35% in multi-turn conversations. Workflow execution emerged as the least challenging area, while most other skills presented considerable difficulties. Confidentiality awareness was notably low, and attempts to improve it through prompting often led to decreased task accuracy. These insights reveal a pronounced gap between LLM capabilities and enterprise requirements.

For more information, check out the Paper, GitHub Page, Hugging Face Page, and Technical Blog. All credit for this research goes to the researchers involved in this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and subscribe to our newsletter.