SOTAVerified

OVMR: Open-Vocabulary Recognition with Multi-Modal References

2024-06-07CVPR 2024Code Available1· sign in to hype

Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, , through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to fuse uni-modal and multi-modal classifiers, with the aim to alleviate issues of low-quality exemplar images or textual descriptions. The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet. Extensive experiments have demonstrated the promising performance of OVMR, , it outperforms existing methods across various scenarios and setups. Codes are publicly available at https://github.com/Zehong-Ma/OVMR.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
LVIS v1.0OVMRAP novel-LVIS base training34.4Unverified

Reproductions