Few-Shot Object Detection with Foundation Models

2024-01-01CVPR 2024Unverified0· sign in to hype

Guangxing Han, Ser-Nam Lim

Unverified — Be the first to reproduce this paper.

Abstract

Few-shot object detection (FSOD) aims to detect objects with only a few training examples. Visual feature extraction and query-support similarity learning are the two critical components. Existing works are usually developed based on ImageNet pre-trained vision backbones and design sophisticated metric-learning networks for few-shot learning but still have inferior accuracy. In this work we study few-shot object detection using modern foundation models. First vision-only contrastive pre-trained DINOv2 model is used for the vision backbone which shows strong transferable performance without tuning the parameters. Second Large Language Model (LLM) is employed for contextualized few-shot learning with the input of all classes and query image proposals. Language instructions are carefully designed to prompt the LLM to classify each proposal in context. The contextual information include proposal-proposal relations proposal-class relations and class-class relations which can largely promote few-shot learning. We comprehensively evaluate the proposed model (FM-FSOD) in multiple FSOD benchmarks achieving state-of-the-arts performance.

Tasks

Few-Shot Learning Few-Shot Object Detection Language Modeling Language Modelling Large Language Model Metric Learning Object object-detection Object Detection

Few-Shot Object Detection with Foundation Models

Abstract

Tasks

Reproductions