TinyClick: Single-Turn Agent for Empowering GUI Automation

2024-10-09Unverified0· sign in to hype

Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

Tasks

Data Augmentation GPU Language Modeling Language Modelling

TinyClick: Single-Turn Agent for Empowering GUI Automation

Abstract

Tasks

Reproductions