UniFa: A unified feature hallucination framework for any-shot object detection
Hui Nie, Ruiping Wang, Xilin Chen
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Any-shot object detection seeks to simultaneously detect base (many-shot), few-shot and zero-shot categories. The primary challenge lies in insufficient visual data for rare (few-shot and zero-shot) categories, hindering effective training. Existing methods use visual feature generation to alleviate it, but the quality of the generated features is low and limited to zero-shot object detection task (i.e., only including zero-shot categories). This mainly arises from semantic information for feature generation trained on unimodal data lacking visual-awareness, and the significant distinctness of generated features across categories. To tackle these issues, we introduce the Unified Feature Hallucination (UniFa) framework, which generates high-quality features for two rare categories. Utilizing CLIP’s text encoder, we transform category names into visual-aware semantic information for generating visual features, facilitating better visual-semantic alignment. A semantically blended feature enhancer is utilized to merge features from any two categories, producing denser and more realistic features. The effectiveness of our approach is confirmed through extensive experiments on MSCOCO datasets.