AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities Nov 12, 2022 Contrastive Learning Cross-Modal Retrieval
Code Code Available 4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 4ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities May 18, 2023 1 Image, 2*2 Stitchi Action Classification
Code Code Available 3Sigmoid Loss for Language Image Pre-Training Mar 27, 2023 Contrastive Learning Disentanglement
Code Code Available 3Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment Jan 1, 2024 cross-modal alignment Cross-Modal Retrieval
Code Code Available 2Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 2RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Jun 20, 2023 Cross-Modal Retrieval Image Retrieval
Code Code Available 2Learning Transferable Visual Models From Natural Language Supervision Feb 26, 2021 Action Recognition Benchmarking
Code Code Available 2Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 2Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment Apr 28, 2024 Cross-Modal Retrieval Image Retrieval
Code Code Available 2CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 1A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval Jun 4, 2021 Graph Matching Image Retrieval
Code Code Available 1A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval Dec 6, 2022 Cross-Modal Retrieval Image-text matching
Code Code Available 1Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 1Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models Jun 10, 2025 Contrastive Learning Image-text matching
Code Code Available 1FETA: Towards Specializing Foundation Models for Expert Task Applications Sep 8, 2022 Domain Generalization Few-Shot Learning
Code Code Available 1FLAVA: A Foundational Language And Vision Alignment Model Dec 8, 2021 Image Retrieval Image-to-Text Retrieval
Code Code Available 1IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages Jan 27, 2022 Cross-Modal Retrieval Few-Shot Learning
Code Code Available 1InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 1Learning Relation Alignment for Calibrated Cross-modal Retrieval May 28, 2021 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 1Vision-Language Dataset Distillation Aug 15, 2023 Dataset Distillation image-classification
Code Code Available 1Negative Pre-aware for Noisy Cross-modal Matching Dec 10, 2023 Cross-modal retrieval with noisy correspondence Image-text matching
Code Code Available 1PRIOR: Prototype Representation Joint Learning from Medical Images and Reports Jul 24, 2023 Contrastive Learning Image to text
Code Code Available 1Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Sep 29, 2023 Cross-Modal Retrieval Image-text matching
Code Code Available 1Rethinking Benchmarks for Cross-modal Image-text Retrieval Apr 21, 2023 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 1UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers Jan 31, 2023 Image Captioning Image Classification
Code Code Available 1WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Mar 11, 2021 Contrastive Learning GPU
Code Code Available 1Accept the Modality Gap: An Exploration in the Hyperbolic Space Jan 1, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Is Cross-modal Information Retrieval Possible without Training? Apr 20, 2023 Contrastive Learning Cross-Modal Information Retrieval
— Unverified 0DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding Dec 2, 2024 Caption Generation Domain Generalization
— Unverified 0Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization Sep 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 0COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval Apr 15, 2022 Contrastive Learning Cross-Modal Retrieval
— Unverified 0Hierarchical Gumbel Attention Network for Text-based Person Search Oct 10, 2020 Image Retrieval Image to text
— Unverified 0CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? Mar 7, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization Oct 30, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images Mar 13, 2023 Common Sense Reasoning Explanation Generation
— Unverified 0Towards a Visual-Language Foundation Model for Computational Pathology Jul 24, 2023 Contrastive Learning image-classification
— Unverified 0Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning May 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 0SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs Apr 17, 2025 Cross-Modal Retrieval Image Retrieval
— Unverified 0Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration Jun 12, 2025 cross-modal alignment Image to text
— Unverified 0A survey on knowledge-enhanced multimodal learning Nov 19, 2022 Conditional Image Generation Factual Visual Question Answering
— Unverified 0Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval Jul 29, 2022 Cross-Modal Retrieval Data Augmentation
— Unverified 0Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training Aug 16, 2019 Image-text matching Image-text Retrieval
— Unverified 0ABC: Achieving Better Control of Multimodal Embeddings using VLMs Mar 1, 2025 Image to text Image-to-Text Retrieval
— Unverified 0Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation Jan 1, 2025 image-classification Image Classification
— Unverified 0When are Lemons Purple? The Concept Association Bias of Vision-Language Models Dec 22, 2022 Attribute image-classification
— Unverified 0Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution May 16, 2025 Cross-Modal Retrieval Image to text
— Unverified 0Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset May 25, 2022 Image Captioning Image Retrieval
— Unverified 0DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation Apr 16, 2025 Contrastive Learning Image to text
— Unverified 0GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models Jul 30, 2024 Image to text Image-to-Text Retrieval
Code Code Available 0