From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page Tasks

Web automation agents are becoming increasingly important in artificial intelligence due to their ability to perform human-like actions in digital environments. These agents interact with websites through Graphical User Interfaces (GUIs), mimicking human behaviors such as clicking, typing, and navigating across web pages. This approach allows them to operate without dedicated Application Programming Interfaces (APIs), which are often limited or unavailable in many web applications.

As the capabilities of large language models (LLMs) evolve, the need for more comprehensive evaluations of these agents becomes apparent. Many tasks performed on websites, like data retrieval from different pages or applying complex rules, demand significant cognitive effort. However, most existing benchmarks focus on simplified scenarios, failing to accurately assess the agents’ capabilities in handling real-world, memory-intensive tasks.

Previous benchmarks like WebArena evaluated agents in more general terms but did not adequately challenge them. Task diversity and complexity were limited, hindering measurements in areas requiring complex decision-making and long-term memory. This gap led researchers from the University of Tokyo to introduce WebChoreArena, an expanded framework designed to provide a more rigorous evaluation of agents.

Overview of WebChoreArena

WebChoreArena features a total of 532 curated tasks distributed across simulated websites. These tasks, categorized into four main types, reflect real-world scenarios requiring agents to perform demanding operations:

Massive Memory: 117 tasks that test agents’ ability to extract and remember large volumes of information.
Calculation: 132 tasks that involve arithmetic operations based on multiple data points.
Long-Term Memory: 127 tasks designed to assess the agent’s ability to connect information across different web pages.
Others: 65 tasks that include unique operations not fitting traditional formats.

Evaluation Insights

In testing the benchmark, researchers used three prominent large language models: GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro. They also utilized two advanced web agents, AgentOccam and BrowserGym. The results highlighted the increased difficulty of WebChoreArena compared to earlier benchmarks:

GPT-4o scored only 6.8% accuracy on WebChoreArena, a significant drop from 42.8% on WebArena.
Gemini 2.5 Pro achieved the highest score at 44.9%, indicating substantial limitations in handling complex tasks.

WebChoreArena’s design and testing have established it as a valuable tool for assessing the performance of web automation agents. It offers a clearer performance gradient between models, enhancing the benchmarking process for ongoing advancements in web agent technologies.

Conclusion

This research underscores the disparity between general browsing proficiency and the advanced cognitive abilities necessary for more complex web-based tasks. By focusing on reasoning, memory, and logic, WebChoreArena fills the existing gap in benchmarks, aiming to equip agents for real-world automation challenges.

For further exploration, refer to the Paper, GitHub Page, and Project Page.