UltraCUA: A Foundation Computer-Use Agents Model that Bridges the Gap between General-Purpose GUI Agents and Specialized API-based Agents
Understanding the Target Audience
The target audience for UltraCUA includes AI researchers, business managers, and software developers interested in enhancing the efficiency of computer-use agents. Their pain points often revolve around the limitations of current agents, which are restricted to basic actions like clicking and typing. They seek solutions that improve reliability and reduce the complexity of task execution. This audience values clear communication, technical specifications, and peer-reviewed data to support decision-making.
Overview of UltraCUA
Computer-use agents have traditionally been limited to basic actions such as clicking, typing, and scrolling. These long action chains can lead to grounding errors and inefficiencies. Apple Researchers have introduced UltraCUA, a foundation model that creates a hybrid action space, allowing agents to interleave low-level GUI actions with high-level programmatic tool calls. This model selects the most efficient and reliable action at each step, improving success rates and reducing the number of steps required to complete tasks.
Hybrid Action Changes
Hybrid action treats tools as first-class actions. A tool call encapsulates a multi-step operation as a single function with a clear signature and documentation. While clicks and key presses remain available when no programmatic path exists, the agent learns to alternate between both modes. This approach aims to minimize cascade errors and reduce the total number of steps needed to complete tasks, effectively bridging the gap between GUI-only computer-use agents and tool-centric agent frameworks.
Scaled Tool Acquisition
UltraCUA builds its tool library through an automated pipeline that extracts keyboard shortcuts and commands from software documentation. It integrates open-source implementations from agent toolkits and employs coding agents to synthesize new tools. Each tool serves as a callable interface that simplifies complex GUI sequences. The research team reports coverage across 10 desktop domains with 881 tools, including 135 tools for VS Code and 123 for LibreOffice Writer.
Verifiable Synthetic Tasks and Trajectories
Training requires grounded supervision and stable rewards. UltraCUA employs a dual synthetic engine: one pipeline composes atomic verifiers for browsers, files, images, and system states, while another generates context-aligned tasks. This results in 17,864 verifiable tasks across 10 domains, including Chrome, LibreOffice, GIMP, and multi-application workflows.
Multi-Agent Rollout
A multi-agent rollout produces successful hybrid trajectories. The planner utilizes OpenAI o3 for decision-making, while the grounder employs GTA1-7B for accurate visual localization. This rollout yields approximately 26,800 successful trajectories that demonstrate when to use a tool and when to act in the GUI, forming the core of the supervised training phase.
Training Approach
Training consists of two stages: supervised fine-tuning and online reinforcement learning. In Stage 1, models are trained for 3 epochs at a learning rate of 2e-5 on successful trajectories. Loss is applied turn-wise to prevent over-weighting early steps. Stage 2 involves online reinforcement learning, where models are trained for 150 steps at a learning rate of 1e-6 on verified tasks sampled by difficulty. The policy optimization follows a GRPO variant, combining sparse task outcomes with a tool use term.
Results on OSWorld
UltraCUA demonstrates improved success rates at both 7B and 32B scales. Under a 15-step budget, UltraCUA-32B achieves a 41.0% success rate, compared to OpenCUA-32B at 29.7%. The absolute gain is 11.3 points. UltraCUA-7B reaches 28.9%, while UI-TARS-1.5-7B achieves 23.4%. These improvements persist under 50-step budgets, indicating better action selection rather than merely increased attempts.
Cross-Platform Transfer on WindowsAgentArena
UltraCUA is trained solely on Ubuntu-based OSWorld data and evaluated on WindowsAgentArena. UltraCUA-7B achieves a 21.7% success rate, surpassing UI-TARS-1.5-7B at 18.1% and a Qwen2 baseline trained with Windows data at 13.5%. This suggests that hybrid action strategies learned on one platform can effectively transfer to others, highlighting zero-shot platform generalization.
Key Takeaways
- UltraCUA formalizes a hybrid action space that allows a single agent to alternate between GUI primitives and programmatic tool calls, reducing long error-prone action chains.
- The research team scales a reusable tool library through an automated pipeline, paired with a synthetic data engine, yielding over 17,000 verifiable computer-use tasks for training and evaluation.
- Training follows a two-stage approach: supervised fine-tuning on successful hybrid trajectories and online reinforcement learning on verifiable tasks, optimizing the decision-making process between tool calls and GUI actions.
- On OSWorld, UltraCUA reports an average 22% relative improvement over baseline models and 11% fewer steps, indicating gains in reliability and efficiency.
- The 7B model achieves a 21.7% success rate on WindowsAgentArena without Windows-specific training, demonstrating cross-platform generalization of the hybrid action policy.
Editorial Comments
UltraCUA transitions computer-use agents from brittle primitive action chains to a hybrid action policy, integrating GUI primitives with programmatic tool calls to reduce error propagation and step counts. The scalable tool library and synthetic data engine enable effective training and evaluation, leading to significant improvements in task execution efficiency.
Check out the Paper for more details. Feel free to explore our GitHub Page for tutorials, codes, and notebooks. Follow us on Twitter, join our 100k+ ML SubReddit, and subscribe to our newsletter. Are you on Telegram? Now you can join us there as well.