Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

2023-10-17Code Available4· sign in to hype

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao

Code Available — Be the first to reproduce this paper.

Code

github.com/microsoft/SoM
OfficialIn paperpytorch★ 1,520
github.com/OthersideAI/self-operating-computer
none★ 10,204
github.com/ux-decoder/segment-everything-everywhere-all-at-once
pytorch★ 4,771
github.com/ddupont808/gpt-4v-act
none★ 1,063

Abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Tasks

Interactive Segmentation Referring Expression Referring Expression Comprehension Segmentation Visual Grounding Visual Prompting

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Code

Abstract

Tasks

Reproductions