Aria-UI: Visual Grounding for GUI Instructions

2024-12-20Code Available3· sign in to hype

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li

Code Available — Be the first to reproduce this paper.

Code

github.com/ariaui/aria-ui
pytorch★ 399

Abstract

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.

Tasks

Natural Language Visual Grounding Visual Grounding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ScreenSpot	Aria-UI	Accuracy (%)	81.1	—	Unverified

Aria-UI: Visual Grounding for GUI Instructions

Code

Abstract

Tasks

Benchmark Results

Reproductions