«`html
Gelato-30B-A3B: A State-of-the-Art Grounding Model for GUI Computer-Use Tasks
Researchers from ML Foundations have introduced Gelato-30B-A3B, a grounding model designed to enhance AI agents’ ability to find and click specific on-screen elements based on natural language instructions. This model is trained on the Click 100k dataset and demonstrates significant accuracy improvements over previous models, including GTA1-32B and larger vision-language models like Qwen3-VL-235B-A22B-Instruct.
Understanding the Target Audience
The primary audience for Gelato-30B-A3B includes:
- AI researchers and developers looking for advanced grounding models.
- Business managers interested in implementing AI solutions for GUI tasks.
- Technical teams focused on enhancing user interaction with software applications.
Key pain points for this audience include:
- Difficulty in achieving reliable AI interactions with diverse graphical user interfaces.
- Challenges in integrating AI models into existing workflows.
- Need for improved accuracy in AI-driven tasks to enhance productivity.
Their goals involve:
- Implementing AI solutions that can accurately interpret user commands.
- Reducing the time and effort required for software navigation.
- Enhancing user experience through seamless AI interactions.
Interests include:
- Latest advancements in AI and machine learning.
- Practical applications of AI in business environments.
- Data-driven insights into user behavior and software usage.
Preferred communication methods are likely to be:
- Technical documentation and research papers.
- Webinars and online tutorials.
- Community forums and discussions on platforms like GitHub and Reddit.
What Gelato-30B-A3B Does in an Agent Stack
Gelato-30B-A3B is a 31B parameter model that fine-tunes Qwen3-VL-30B-A3B Instruct using a mixture of experts architecture. It processes a screenshot and a textual instruction to produce a single click coordinate as output. This model acts as a modular grounding component, allowing a planner model, such as GPT-5, to determine high-level actions and utilize Gelato for precise click resolutions across various operating systems and applications.
Click 100k: A Targeted Dataset for GUI Grounding
The Click 100k dataset underpins Gelato, pairing computer screen images with natural language instructions, bounding boxes for target elements, image dimensions, and normalized bounding boxes. Each sample is structured as a low-level command, such as “tap on the element between Background and Notifications options,” with precise regions defined.
This dataset is constructed by filtering and unifying multiple public sources, including:
- ShowUI
- AutoGUI
- PC Agent E
- WaveUI
- OS Atlas
- UGround
- PixMo Points
- SeeClick
- UI VISION
- JEDI subset focusing on spreadsheet and text cell manipulation
Each source contributes a maximum of 50k samples, all mapped into a shared schema. The research team employs an aggressive filtering pipeline to enhance data quality, ensuring that only relevant and accurate samples are included.
GRPO Training on Top of Qwen3 VL
Gelato-30B-A3B employs GRPO, a reinforcement learning algorithm, to enhance its training. The model initializes from Qwen3 VL 30B A3B Instruct and undergoes 100 reinforcement learning steps on 32 A100 GPUs with 40 GB memory. The best checkpoint is selected based on mean performance across various benchmarks, achieving:
- 63.88% accuracy on ScreenSpot Pro
- 67.19% on OS World G
- 73.40% on OS World G Refined
A simple refusal prompting strategy further improves scores, raising OS World G results to:
- 69.15% on OS World G
- 74.65% on OS World G Refined
End-to-End Agent Results on OS World
When integrated into the GTA1.5 agent framework, Gelato-30B-A3B demonstrates improved performance in real-world tasks. In this setup, GPT-5 serves as the planner, while Gelato provides grounding. The model achieves:
- 58.71% automated success rate on OS World tasks
- 61.85% success rate under human evaluation
Key Takeaways
Gelato-30B-A3B establishes a new benchmark for GUI grounding models, surpassing previous models like GTA1-32B and larger vision-language models. Its training on the Click 100k dataset, combined with a GRPO reinforcement learning approach, significantly enhances grounding accuracy and overall agent performance.
For further exploration, visit the GitHub repository for tutorials, codes, and notebooks.
«`