OVMR: Open-Vocabulary Recognition with Multi-Modal References
Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/zehong-ma/ovmrOfficialIn paperpytorch★ 36
Abstract
The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, , through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to fuse uni-modal and multi-modal classifiers, with the aim to alleviate issues of low-quality exemplar images or textual descriptions. The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet. Extensive experiments have demonstrated the promising performance of OVMR, , it outperforms existing methods across various scenarios and setups. Codes are publicly available at https://github.com/Zehong-Ma/OVMR.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| LVIS v1.0 | OVMR | AP novel-LVIS base training | 34.4 | — | Unverified |