Salesforce AI Research Introduces WALT (Web Agents that Learn Tools): Enabling LLM agents to Automatically Discover Reusable Tools from Any Website

«`html

Salesforce AI Research Introduces WALT (Web Agents that Learn Tools)

A team of Salesforce AI researchers has introduced WALT (Web Agents that Learn Tools), a framework designed to reverse-engineer latent website functionality into reusable invocable tools. This approach reframes browser automation around callable tools rather than long sequences of clicks. Agents can execute operations such as search, filter, sort, post_comment, and create_listing, which reduces reliance on large language model step-by-step reasoning and enhances determinism during execution.

What WALT Builds

Web agents often encounter failures when website layouts shift or when tasks require lengthy sequences. WALT addresses this issue by mining site functionality offline and exposing it as tools that encapsulate navigation, selection, extraction, and optional agentic steps. Each tool carries contracts in the form of schemas and examples. During runtime, an agent composes a short program with a few tool calls to complete a task, aiming for higher success rates with fewer steps and reduced reliance on free-form reasoning.

Pipeline in Two Phases

The WALT pipeline consists of two phases: discovery and construction with validation. In the discovery phase, WALT explores a website and proposes tool candidates that map to common goals such as discovery, content management, and communication. In the construction and validation phase, WALT converts traces into deterministic scripts, stabilizes selectors, promotes URLs when possible, induces input schemas, and registers a tool only after end-to-end checks pass. This process shifts as much work as possible into stable URL and form operations, reserving agentic grounding for cases that truly require it.

Results on VisualWebArena and WebArena

On VisualWebArena, WALT achieves an average success rate of 52.9 percent, with specific results of 64.1 percent on Classifieds, 53.4 percent on Shopping, and 39.0 percent on Reddit. In comparison, baseline methods such as SGV report a success rate of 50.2 percent and ExaCT at 33.7 percent. Human performance averages 88.7 percent.

On WebArena, WALT reaches an average success rate of 50.1 percent across categories including GitLab, Map, Shopping, CMS, Reddit, and Multi, outperforming prior methods by a nine-point margin over the best skill induction baseline. Human performance in this arena is 78.2 percent.

Efficiency and Ablations

The use of tools reduces the action count by a factor of approximately 1.4 on average compared to a matched agent without tools. In the Classifieds split, ablation studies indicate consistent gains when tools are utilized across different agent backbones. WALT with GPT 5 mini records a 7 percent higher success rate and 27 percent fewer steps, while a human demonstration strategy yields a 66.0 percent success rate. The fully autonomous WALT achieves a 64.1 percent success rate with 5 percent fewer steps than the human demonstration case. Multimodal DOM parsing contributes an additional 2.6 percent absolute improvement, while external verification adds 3.3 percent, increasing checks. Overall, WALT records 21.3 percent fewer steps than baseline policies.

Design Choices that Enforce Determinism

WALT prioritizes URL-level operations when the site exposes query parameters or routes for search and filtering. For pages requiring dynamic grounding, the tool script incorporates bounded agentic steps such as content extraction or waiting for page load. Selector stabilization and schema validation help reduce drift when sites change. The method maintains a low fraction of agentic operations in discovered tool sets and favors deterministic actions like navigation, input, and click.

Key Takeaways

Approach: WALT discovers and validates website-native functions, exposing them as callable tools with input schemas, selector stabilization, and URL promotion, thereby reducing brittle step sequences to deterministic operations.
Results — VisualWebArena: Average success rate of 52.9 percent, with 64.1 percent on Classifieds, 53.4 percent on Shopping, and 39.0 percent on Reddit, outperforming several baseline methods.
Results — WebArena: Average success rate of 50.1 percent across GitLab, Map, Shopping, CMS, Reddit, and Multi, showing consistent gains over skill-induction and search-based baselines.
Efficiency and Ablations: Toolization reduces steps by about 1.4x, with 21.3 percent fewer actions on average. Multimodal DOM parsing adds +2.6 percent absolute success, and external verification adds +3.3 percent.

Editorial Comments

WALT represents a significant shift from step sequence agents to functionality-grounded tools. The framework effectively reverse engineers latent website functionality into reusable invocable tools across discovery, content management, and communication. By promoting UI traces to deterministic tools with schema validation and URL operations, WALT enhances web agent success to 52.9 percent on VisualWebArena and 50.1 percent on WebArena, while reducing actions by approximately 21.3 percent.

For further details, check out the Paper and visit our GitHub Page for tutorials, codes, and notebooks. Follow us on Twitter, join our 100k+ ML SubReddit, and subscribe to our Newsletter. Additionally, you can join us on Telegram.

«`