VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/fartashf/vseppOfficialIn paperpytorch★ 521
- github.com/cshizhe/hgr_v2tpytorch★ 211
- github.com/leolee99/CLIP_ITMpytorch★ 19
- github.com/salanueva/UniVSEpytorch★ 10
- github.com/armandvilalta/Full-network-multimodal-embeddingsnone★ 2
- github.com/kadarakos/muliserapytorch★ 0
- github.com/mitjanikolaus/compositional-image-captioningpytorch★ 0
- github.com/Cadene/recipe1m.bootstrap.pytorchpytorch★ 0
- github.com/rohitbhaskar/online-ads-repositorypytorch★ 0
- github.com/gorjanradevski/vsepp_tensorflowtf★ 0
Abstract
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Flickr30k | VSE++ (ResNet) | Image-to-text R@1 | 52.9 | — | Unverified |