←back to Blog

CMU Researchers Introduce Go-Browse: A Graph-Based Framework for Scalable Web Agent Training

CMU Researchers Introduce Go-Browse: A Graph-Based Framework for Scalable Web Agent Training

Understanding the Target Audience

The primary audience for this research includes AI practitioners, business analysts, and decision-makers in technology firms, particularly those focused on automation and web technologies. Their pain points revolve around the limitations of existing digital agents in navigating complex web interfaces, leading to inefficiencies in task completion. Their goals include enhancing the performance and scalability of web agents while reducing the time and resources required for data collection. They have a strong interest in innovative methodologies and are likely to prefer concise, technical communication that highlights practical applications and results.

Why Web Agents Struggle with Dynamic Web Interfaces

Digital agents designed for web environments aim to automate tasks such as navigating pages, clicking buttons, or submitting forms. These agents depend on interpreting browser data and simulating user interactions to fulfill specified tasks. Success in this domain requires an accurate understanding of dynamic content and the ability to provide adaptable responses, as web interfaces vary widely and continually evolve. While pretrained language models have shown proficiency in other areas, their performance in GUI-based web tasks remains limited, primarily due to the complexities and variability of web pages.

Challenges of Data Collection for Web Agents at Scale

A significant challenge arises from the agents’ limited understanding of the environments in which they are expected to operate. Pretrained models often falter when interacting with unfamiliar or complex interfaces. Unlike static datasets, real-world web environments demand continuous decision-making in response to layout differences and shifting user flows, complicating the agents’ ability to reliably find specific products or complete online forms. Although human-curated data could offer guidance, its collection is labor-intensive and cannot scale to meet the diversity of real-world web scenarios.

Review of Past Approaches: Interaction-First vs. Instruction-First Methods

Researchers have explored various methods to collect data for training these agents. The interaction-first approach allows an agent to explore websites based on broad instructions, later labeling their activities using another model. While this may result in more profound exploration, it often leads to redundant behavior, limiting data diversity. Conversely, the instruction-first method generates specific tasks for an agent based on the content of a single web page. Although more focused, these tasks are frequently anchored to visible content and may not be feasible, especially when based on hallucinated elements.

Introducing Go-Browse: Structured Graph-Based Web Exploration

Researchers from Carnegie Mellon University have introduced Go-Browse to address these limitations through a structured exploration strategy. Rather than relying on generic exploration or static task prompts, Go-Browse treats data collection as a graph traversal problem. It iteratively builds a graph of visited URLs, using this structure to explore both previously discovered and new pages. This approach allows the agent to reset to known pages and branch out, reducing redundancy while boosting data variety. Each exploration phase proposes and verifies tasks on a selected page, ensuring that only feasible tasks generate training data.

How Go-Browse Works: Modular Architecture for Exploration and Validation

Go-Browse operates through multiple modules. The NavExplorer module focuses on proposing navigational tasks that connect to new pages. As a web agent, it interacts dynamically with each page to identify links leading to unexplored URLs. Simultaneously, the PageExplorer proposes local tasks for the current page. The FeasibilityChecker module tests these tasks using strong pretrained agents and vision-language models to verify if proposed actions can be successfully completed. Tasks passing this step are labeled as feasible and added to the dataset. The Solvers module then samples additional task completions, both from prefixed starting points and from initial states, using lower-cost models to maximize data generation while conserving resources.

WebArena Evaluation: Go-Browse Surpasses Previous Baselines

The research team evaluated Go-Browse on the WebArena benchmark, known for its difficulty in evaluating GUI-based agents. They collected a dataset comprising approximately 10,000 successful task trajectories and 17,000 unsuccessful ones across 100 unique URLs. Fine-tuning the Qwen-2.5-7B-Instruct model on this dataset produced a task success rate of 21.7%. This performance exceeded GPT-4o-mini by 2.4% and outperformed the prior best sub-10B parameter model, NNetNav, by 2.9%. Given the baseline human success rate of 78%, this performance reflects room for improvement but signifies a substantial advancement.

Why Structured Exploration Boosts Web Agent Intelligence

The research identifies a critical issue—digital agents struggle to understand complex web environments. The proposed method, Go-Browse, addresses this by implementing a structured yet flexible strategy that combines navigation, task planning, and trajectory validation. By treating exploration as a graph traversal task and employing modular verification and sampling, the approach delivers scalable and diverse training data. These contributions yield measurable performance gains, demonstrating the promise of structured exploration for developing more intelligent web agents.

Conclusion

Go-Browse, developed by Carnegie Mellon researchers, is a structured exploration framework that enhances the training of web-based digital agents. By framing exploration as a graph traversal task, it enables scalable and diverse data collection through systematic website navigation and interaction. Utilizing modular components such as NavExplorer and FeasibilityChecker, it generates high-quality, feasible task trajectories. Evaluations on the WebArena benchmark indicate that Go-Browse-trained models outperform previous sub-10B models and even surpass GPT-4o-mini, underscoring the effectiveness of structured data collection in building robust web agents.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 100k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.