Google AI Introduces Gemini 2.5 ‘Computer Use’ (Preview): A Browser-Control Model to Power AI Agents to Interact with User Interfaces

Understanding the Audience

The target audience for Gemini 2.5 includes business managers, software developers, and automation specialists who are looking to enhance productivity through AI-driven solutions. Their pain points often revolve around inefficient workflows, high operational costs, and the need for reliable automation tools that integrate seamlessly with existing systems. Their goals include optimizing processes, reducing manual intervention, and leveraging AI to facilitate complex tasks. They appreciate clear, concise, and technical communication that provides actionable insights and concrete examples of application.

Overview of Gemini 2.5 Computer Use

Gemini 2.5 Computer Use is a specialized variant of Gemini 2.5 designed to plan and execute real UI actions in a live browser through a constrained action API. This model is currently available in public preview via Google AI Studio and Vertex AI. It is primarily targeted at web automation and UI testing, boasting documented improvements in standard web/mobile control benchmarks along with a safety layer that may require human confirmation for high-risk actions.

Features of the Model

Developers can utilize a new computer_use tool that provides function calls such as click_at, type_text_at, and drag_and_drop. The client code executes these actions (for example, using Playwright or Browserbase), captures a fresh screenshot or URL, and continues until the task is completed or a safety rule intervenes. The action space includes 13 predefined UI actions:

open_web_browser
wait_5_seconds
go_back
go_forward
search
navigate
click_at
hover_at
type_text_at
key_combination
scroll_document
scroll_at
drag_and_drop

This can be extended with custom functions such as open_app, long_press_at, and go_home for non-browser actions.

Scope and Constraints

The model is optimized primarily for web browsers, with the current limitation of not being suitable for desktop OS-level control. Mobile scenarios can leverage custom actions while remaining within the same execution loop. A built-in safety monitor can block prohibited actions or require user confirmation before executing high-stakes operations such as payments or accessing sensitive records.

Performance Metrics

The official benchmark, Online-Mind2Web, reports a pass rate of 69.0% based on majority-vote human judgments. The Browserbase matched harness demonstrates that Gemini 2.5 leads other computer-use APIs in both accuracy and latency, reflecting a pass rate of 65.7% on Online-Mind2Web and 79.9% on WebVoyager. Performance is characterized by a latency/quality trade-off, achieving over 70% accuracy with a median latency of approximately 225 seconds in Browserbase runs. For mobile applications, a 69.7% performance rate was recorded using the same API loop with custom actions.

Early Production Signals

Google’s payments platform team reports that Gemini 2.5 has successfully rehabilitated over 60% of previously failing automated UI test executions. Additionally, early external tester Poke.com noted that workflows using the model can be completed approximately 50% faster compared to their next-best alternative.

Conclusion

Gemini 2.5 Computer Use is currently in public preview through Google AI Studio and Vertex AI. It features a constrained API with 13 documented UI actions and requires a client-side executor. With state-of-the-art results on web/mobile control benchmarks, this model shows promise for enhancing UI testing and web operations while maintaining user safety through confirmation mechanisms.

Additional Resources

For more technical details, visit the official Google blog post. You can also check out our GitHub page for tutorials, code, and notebooks. Follow us on Twitter and join our 100k+ ML SubReddit community. Don’t forget to subscribe to our newsletter for the latest updates.