Referring Transformer: A One-step Approach to Multi-task Visual Grounding

2021-06-06NeurIPS 2021Code Available1· sign in to hype

Muchen Li, Leonid Sigal

Code Available — Be the first to reproduce this paper.

Code

github.com/ubc-vision/RefTR
Officialpytorch★ 67

Abstract

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

Tasks

Decoder Referring Expression Referring Expression Comprehension Referring Expression Segmentation Segmentation Visual Grounding Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RefCoCo val	RefTR	Overall IoU	70.56	—	Unverified

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Code

Abstract

Tasks

Benchmark Results

Reproductions