SOTAVerified

ESA: External Space Attention Aggregation for Image-Text Retrieval

2023-10-10journal 2023Code Available1· sign in to hype

Hongguang Zhu; Chunjie Zhang; Yunchao Wei; Shujuan Huang; Yao Zhao

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used 83× image-text pairs than ours, our approach not only surpasses in performance but also accelerates 3× on retrieval time. Codes and pre-trained models are available at https://github.com/KevinLight831/ESA

Tasks

Reproductions