SOTAVerified

TinyClick: Single-Turn Agent for Empowering GUI Automation

2024-10-09Unverified0· sign in to hype

Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

Tasks

Reproductions