SOTAVerified

GLIPv3

2020-02-02CVPR 2020Unverified0· sign in to hype

Jiaxing Zhao

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In the open-set object detection, the alignment of visual and text features is one of the most important factors affecting the final detection performance. This paper proposed a enhanced language and vision feature fusion module, which includes a multi-level test-image cross-attention, a text-image cross-attention and an adapted deformable self-attention. Besides, we added the deep supervison in the multi-modal task training, which is effective for the alignment of visual and text features. Experimental results show that our method performs remarkably well on COCO and LVIS datasets. Specifically, our method achieves *** **********

Tasks

Reproductions