Google AI Introduces Gemini 2.5 ‘Computer Use’ (Preview): A Browser-Control Model to Power AI Agents to Interact with User Interfaces
Understanding the Audience
The target audience for Gemini 2.5 includes business managers, software developers, and automation specialists who are looking to enhance productivity through AI-driven solutions. Their pain points often revolve around inefficient workflows, high operational costs, and the need for reliable automation tools that integrate seamlessly with existing systems. Their goals include optimizing processes, reducing manual intervention, and leveraging AI to facilitate complex tasks. They appreciate clear, concise, and technical communication that provides actionable insights and concrete examples of application.
Overview of Gemini 2.5 Computer Use
Gemini 2.5 Computer Use is a specialized variant of Gemini 2.5 designed to plan and execute real UI actions in a live browser through a constrained action API. This model is currently available in public preview via Google AI Studio and Vertex AI. It is primarily targeted at web automation and UI testing, boasting documented improvements in standard web/mobile control benchmarks along with a safety layer that may require human confirmation for high-risk actions.
Features of the Model
Developers can utilize a new computer_use
tool that provides function calls such as click_at
, type_text_at
, and drag_and_drop
. The client code executes these actions (for example, using Playwright or Browserbase), captures a fresh screenshot or URL, and continues until the task is completed or a safety rule intervenes. The action space includes 13 predefined UI actions:
- open_web_browser
- wait_5_seconds
- go_back
- go_forward
- search
- navigate
- click_at
- hover_at
- type_text_at
- key_combination
- scroll_document
- scroll_at
- drag_and_drop
This can be extended with custom functions such as open_app
, long_press_at
, and go_home
for non-browser actions.
Scope and Constraints
The model is optimized primarily for web browsers, with the current limitation of not being suitable for desktop OS-level control. Mobile scenarios can leverage custom actions while remaining within the same execution loop. A built-in safety monitor can block prohibited actions or require user confirmation before executing high-stakes operations such as payments or accessing sensitive records.
Performance Metrics
The official benchmark, Online-Mind2Web, reports a pass rate of 69.0% based on majority-vote human judgments. The Browserbase matched harness demonstrates that Gemini 2.5 leads other computer-use APIs in both accuracy and latency, reflecting a pass rate of 65.7% on Online-Mind2Web and 79.9% on WebVoyager. Performance is characterized by a latency/quality trade-off, achieving over 70% accuracy with a median latency of approximately 225 seconds in Browserbase runs. For mobile applications, a 69.7% performance rate was recorded using the same API loop with custom actions.
Early Production Signals
Google’s payments platform team reports that Gemini 2.5 has successfully rehabilitated over 60% of previously failing automated UI test executions. Additionally, early external tester Poke.com noted that workflows using the model can be completed approximately 50% faster compared to their next-best alternative.
Conclusion
Gemini 2.5 Computer Use is currently in public preview through Google AI Studio and Vertex AI. It features a constrained API with 13 documented UI actions and requires a client-side executor. With state-of-the-art results on web/mobile control benchmarks, this model shows promise for enhancing UI testing and web operations while maintaining user safety through confirmation mechanisms.
Additional Resources
For more technical details, visit the official Google blog post. You can also check out our GitHub page for tutorials, code, and notebooks. Follow us on Twitter and join our 100k+ ML SubReddit community. Don’t forget to subscribe to our newsletter for the latest updates.