HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling
Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/vill-lab/2024-aaai-hptpytorch★ 73
- github.com/ThomasWangY/2024-AAAI-HPTpytorch★ 73
Abstract
Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Finally, by enhancing multi-granularity knowledge generation, redesigning the relationship-driven attention re-weighting module, and incorporating consistent constraints on the hierarchical text encoder, we propose HPT++, which further improves the performance of HPT. Our experiments are conducted across a wide range of evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization. Extensive results and ablation studies demonstrate the effectiveness of our methods, which consistently outperform existing SOTA methods.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Caltech-101 | HPT++ | Harmonic mean | 96.96 | — | Unverified |
| DTD | HPT++ | Harmonic mean | 74.23 | — | Unverified |
| EuroSAT | HPT++ | Harmonic mean | 87.36 | — | Unverified |
| FGVC-Aircraft | HPT++ | Harmonic mean | 41.33 | — | Unverified |
| Food-101 | HPT++ | Harmonic mean | 91.09 | — | Unverified |
| ImageNet | HPT++ | Harmonic mean | 74.24 | — | Unverified |
| ImageNet-A | HPT++ | Top-1 accuracy % | 51.18 | — | Unverified |
| ImageNet-R | HPT++ | Top-1 accuracy % | 77.52 | — | Unverified |
| ImageNet-S | HPT++ | Top-1 accuracy % | 49.28 | — | Unverified |
| ImageNet V2 | HPT++ | Top-1 accuracy % | 65.31 | — | Unverified |
| Oxford 102 Flower | HPT++ | Harmonic mean | 85.85 | — | Unverified |
| Oxford-IIIT Pet Dataset | HPT++ | Harmonic mean | 96.91 | — | Unverified |
| Stanford Cars | HPT++ | Harmonic mean | 75.59 | — | Unverified |
| SUN397 | HPT++ | Harmonic mean | 81.11 | — | Unverified |
| UCF101 | HPT++ | Harmonic mean | 83.81 | — | Unverified |