Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

2021-04-28ICLR 2022Code Available1· sign in to hype

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

Code Available — Be the first to reproduce this paper.

Code

github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild
Officialtf★ 0
github.com/hanoonaR/object-centric-ovd
pytorch★ 297
github.com/dyabel/detpro
pytorch★ 188

Abstract

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP_r with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP_r. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP_50 on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

Tasks

image-classification Image Classification Knowledge Distillation object-detection Object Detection Open Vocabulary Image Classification Open-vocabulary object detection Open Vocabulary Object Detection Zero-Shot Image Classification Zero-Shot Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
LVIS v1.0	ViLD-ensemble w/ ALIGN (Eb7-FPN)	AP novel-LVIS base training	26.3	—	Unverified
LVIS v1.0	ViLD-ensemble (R152-FPN)	AP novel-LVIS base training	18.7	—	Unverified
LVIS v1.0	ViLD-ensemble (R50-FPN)	AP novel-LVIS base training	16.6	—	Unverified
LVIS v1.0	ViLD (R50-FPN)	AP novel-LVIS base training	16.1	—	Unverified
MSCOCO	ViLD	AP 0.5	27.6	—	Unverified
Objects365	ViLD	mask AP50	18.2	—	Unverified

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Code

Abstract

Tasks

Benchmark Results

Reproductions