SOTAVerified

Image-text Retrieval

Papers

Showing 201225 of 248 papers

TitleStatusHype
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval0
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkCode0
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationCode5
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval0
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation0
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning0
Constructing Phrase-level Semantic Labels to Form Multi-GrainedSupervision for Image-Text Retrieval0
SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval0
FILIP: Fine-grained Interactive Language-Image Pre-TrainingCode1
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text RetrievalCode0
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-ExpertsCode1
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval0
Multi-stage Pre-training over Simplified Multimodal Pre-training ModelsCode0
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationCode1
Dynamic Modality Interaction Modeling for Image-Text RetrievalCode1
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training0
CoSMo: Content-Style Modulation for Image Retrieval With Text FeedbackCode1
A Deep Local and Global Scene-Graph Matching for Image-Text RetrievalCode1
Learning Relation Alignment for Calibrated Cross-modal RetrievalCode1
Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval0
Playing Lottery Tickets with Vision and Language0
Continual learning in cross-modal retrieval0
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training0
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalCode1
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningCode2
Show:102550
← PrevPage 9 of 10Next →

No leaderboard results yet.