AI Agents Gain Ability to Control User Interfaces
Google has unveiled the Gemini 2.5 Computer Use model, a new AI system designed to interact directly with graphical user interfaces (UIs) through clicks, typing, and scrolling. Built on the foundation of Gemini 2.5 Pro’s advanced visual reasoning, this model enables developers to create agents capable of performing digital tasks across websites and applications — such as filling forms or navigating dashboards — without relying solely on APIs. The model is now available via the Gemini API in Google AI Studio and Vertex AI.
According to Google, Gemini 2.5 Computer Use outperforms competing models on web and mobile automation benchmarks, including Online-Mind2Web, WebVoyager, and AndroidWorld, all while maintaining significantly lower latency. The company says these improvements mark a major step toward building general-purpose AI agents that can interact with computers much like humans do.
How the Model Works
Developers can access the model through the new computer_use
tool within the Gemini API. The system operates in a continuous loop: the user provides a request, a screenshot of the current UI, and a history of recent actions. Gemini then analyzes these inputs and outputs an action — for instance, clicking a button or entering text. After execution, the environment sends a new screenshot back to the model, repeating the cycle until the task is completed or stopped by a safety mechanism.
The tool supports both standard and custom UI actions, allowing agents to handle interactive elements such as dropdown menus, login fields, and form submissions. While currently optimized for web browsers, Gemini 2.5 Computer Use has also demonstrated promising results in mobile UI control but is not yet tuned for desktop operating systems.
Performance and Benchmarks
Gemini 2.5 Computer Use delivers top-tier performance in browser automation. On Browserbase’s Online-Mind2Web benchmark, it achieved over 70% accuracy with approximately 225 seconds of latency — outperforming leading alternatives in both speed and precision. Google’s internal tests also confirmed superior results in mobile automation tasks, solidifying its position as one of the most efficient models in its category.
Early demos show the model executing complex, multi-step instructions, such as gathering data from websites and updating CRM records or organizing virtual sticky notes on collaborative platforms. These examples highlight the model’s ability to parse complex web layouts, interpret dynamic content, and complete workflows autonomously.
Safety and Responsible Deployment
Recognizing the potential risks of allowing AI agents to control computers, Google integrated multiple safety layers directly into the Gemini 2.5 Computer Use model. These include protection against malicious prompts, unauthorized actions, and attempts to bypass security mechanisms such as CAPTCHAs. The model also employs an external per-step safety service that evaluates each proposed action before execution.
Developers can customize safety policies through system instructions, specifying which actions require explicit user approval. Google recommends additional best practices for those integrating the tool into production environments, emphasizing that all implementations should undergo thorough safety testing prior to deployment.
Early Adoption and Use Cases
Google’s internal teams have already deployed the model for UI testing and workflow automation, notably within the Firebase Testing Agent and AI Mode in Search. Early external adopters, including Poke.com and Autotab, report major improvements in task reliability and execution speed — with some noting up to 50% faster results than other automation models.
One Google payments team found the model capable of “rehabilitating” over 60% of previously failed automated UI tests by autonomously identifying and correcting issues. These early successes demonstrate the potential of AI-driven UI control to accelerate testing, data processing, and digital workflow management across industries.
Availability
The Gemini 2.5 Computer Use model is now in public preview and can be accessed via Google AI Studio or Vertex AI. Developers can experiment in Browserbase’s demo environment or build their own automation loops using Playwright or a cloud virtual machine. Documentation and setup guides are available for both individual developers and enterprise users looking to integrate the system into larger production pipelines.