AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities Nov 12, 2022 Contrastive Learning Cross-Modal Retrieval
Code Code Available 45 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 45 ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities May 18, 2023 1 Image, 2*2 Stitchi Action Classification
Code Code Available 35 Sigmoid Loss for Language Image Pre-Training Mar 27, 2023 Contrastive Learning Disentanglement
Code Code Available 35 Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment Jan 1, 2024 cross-modal alignment Cross-Modal Retrieval
Code Code Available 25 Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 25 RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Jun 20, 2023 Cross-Modal Retrieval Image Retrieval
Code Code Available 25 Learning Transferable Visual Models From Natural Language Supervision Feb 26, 2021 Action Recognition Benchmarking
Code Code Available 25 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 25 Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment Apr 28, 2024 Cross-Modal Retrieval Image Retrieval
Code Code Available 25 CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 15 A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval Jun 4, 2021 Graph Matching Image Retrieval
Code Code Available 15 A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval Dec 6, 2022 Cross-Modal Retrieval Image-text matching
Code Code Available 15 Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 15 Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models Jun 10, 2025 Contrastive Learning Image-text matching
Code Code Available 15 FETA: Towards Specializing Foundation Models for Expert Task Applications Sep 8, 2022 Domain Generalization Few-Shot Learning
Code Code Available 15 FLAVA: A Foundational Language And Vision Alignment Model Dec 8, 2021 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages Jan 27, 2022 Cross-Modal Retrieval Few-Shot Learning
Code Code Available 15 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 15 Learning Relation Alignment for Calibrated Cross-modal Retrieval May 28, 2021 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 15 Vision-Language Dataset Distillation Aug 15, 2023 Dataset Distillation image-classification
Code Code Available 15 Negative Pre-aware for Noisy Cross-modal Matching Dec 10, 2023 Cross-modal retrieval with noisy correspondence Image-text matching
Code Code Available 15 PRIOR: Prototype Representation Joint Learning from Medical Images and Reports Jul 24, 2023 Contrastive Learning Image to text
Code Code Available 15 Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Sep 29, 2023 Cross-Modal Retrieval Image-text matching
Code Code Available 15 Rethinking Benchmarks for Cross-modal Image-text Retrieval Apr 21, 2023 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 15 UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers Jan 31, 2023 Image Captioning Image Classification
Code Code Available 15 WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Mar 11, 2021 Contrastive Learning GPU
Code Code Available 15 HADA: A Graph-based Amalgamation Framework in Image-text Retrieval Jan 11, 2023 Graph Neural Network Image Retrieval
Code Code Available 05 Towards a text-based quantitative and explainable histopathology image analysis Jul 10, 2024 image-classification Image Classification
Code Code Available 05 BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval Jun 14, 2024 Image Retrieval Image to text
Code Code Available 05 OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation Jul 1, 2021 Audio to Text Retrieval Cross-Modal Retrieval
Code Code Available 05 Deep Visual-Semantic Alignments for Generating Image Descriptions Dec 7, 2014 Cross-Modal Retrieval Image Captioning
Code Code Available 05 Design of the topology for contrastive visual-textual alignment Sep 5, 2022 Contrastive Learning Image-to-Text Retrieval
Code Code Available 05 ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training Sep 30, 2022 Computational Efficiency Contrastive Learning
Code Code Available 05 Exploring Models and Data for Remote Sensing Image Caption Generation Dec 21, 2017 Caption Generation Image-to-Text Retrieval
Code Code Available 05 Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task Oct 8, 2019 Cross-Modal Retrieval Image to text
Code Code Available 05 GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models Jul 30, 2024 Image to text Image-to-Text Retrieval
Code Code Available 05 Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning May 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 00 DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding Dec 2, 2024 Caption Generation Domain Generalization
— Unverified 00 Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization Sep 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 00 A survey on knowledge-enhanced multimodal learning Nov 19, 2022 Conditional Image Generation Factual Visual Question Answering
— Unverified 00 Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval Jul 29, 2022 Cross-Modal Retrieval Data Augmentation
— Unverified 00 Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training Aug 16, 2019 Image-text matching Image-text Retrieval
— Unverified 00 ABC: Achieving Better Control of Multimodal Embeddings using VLMs Mar 1, 2025 Image to text Image-to-Text Retrieval
— Unverified 00 Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation Jan 1, 2025 image-classification Image Classification
— Unverified 00 When are Lemons Purple? The Concept Association Bias of Vision-Language Models Dec 22, 2022 Attribute image-classification
— Unverified 00 Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution May 16, 2025 Cross-Modal Retrieval Image to text
— Unverified 00 Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset May 25, 2022 Image Captioning Image Retrieval
— Unverified 00 DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation Apr 16, 2025 Contrastive Learning Image to text
— Unverified 00 COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval Apr 15, 2022 Contrastive Learning Cross-Modal Retrieval
— Unverified 00