BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 4AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities Nov 12, 2022 Contrastive Learning Cross-Modal Retrieval
Code Code Available 4ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities May 18, 2023 1 Image, 2*2 Stitchi Action Classification
Code Code Available 3Sigmoid Loss for Language Image Pre-Training Mar 27, 2023 Contrastive Learning Disentanglement
Code Code Available 3Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment Apr 28, 2024 Cross-Modal Retrieval Image Retrieval
Code Code Available 2Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment Jan 1, 2024 cross-modal alignment Cross-Modal Retrieval
Code Code Available 2RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Jun 20, 2023 Cross-Modal Retrieval Image Retrieval
Code Code Available 2Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs Jun 9, 2022 Image Captioning Image Classification
Code Code Available 2Learning Transferable Visual Models From Natural Language Supervision Feb 26, 2021 Action Recognition Benchmarking
Code Code Available 2Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 2Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models Jun 10, 2025 Contrastive Learning Image-text matching
Code Code Available 1InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 1Negative Pre-aware for Noisy Cross-modal Matching Dec 10, 2023 Cross-modal retrieval with noisy correspondence Image-text matching
Code Code Available 1Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval Sep 29, 2023 Cross-Modal Retrieval Image-text matching
Code Code Available 1Vision-Language Dataset Distillation Aug 15, 2023 Dataset Distillation image-classification
Code Code Available 1PRIOR: Prototype Representation Joint Learning from Medical Images and Reports Jul 24, 2023 Contrastive Learning Image to text
Code Code Available 1CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers May 27, 2023 Image Captioning Image Retrieval
Code Code Available 1Rethinking Benchmarks for Cross-modal Image-text Retrieval Apr 21, 2023 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 1UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers Jan 31, 2023 Image Captioning Image Classification
Code Code Available 1A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval Dec 6, 2022 Cross-Modal Retrieval Image-text matching
Code Code Available 1FETA: Towards Specializing Foundation Models for Expert Task Applications Sep 8, 2022 Domain Generalization Few-Shot Learning
Code Code Available 1IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages Jan 27, 2022 Cross-Modal Retrieval Few-Shot Learning
Code Code Available 1FLAVA: A Foundational Language And Vision Alignment Model Dec 8, 2021 Image Retrieval Image-to-Text Retrieval
Code Code Available 1Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Jul 16, 2021 Cross-Modal Retrieval Grounded language learning
Code Code Available 1A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval Jun 4, 2021 Graph Matching Image Retrieval
Code Code Available 1Learning Relation Alignment for Calibrated Cross-modal Retrieval May 28, 2021 Cross-Modal Retrieval Image-text Retrieval
Code Code Available 1WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Mar 11, 2021 Contrastive Learning GPU
Code Code Available 1Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration Jun 12, 2025 cross-modal alignment Image to text
— Unverified 0Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution May 16, 2025 Cross-Modal Retrieval Image to text
— Unverified 0SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs Apr 17, 2025 Cross-Modal Retrieval Image Retrieval
— Unverified 0DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation Apr 16, 2025 Contrastive Learning Image to text
— Unverified 0ABC: Achieving Better Control of Multimodal Embeddings using VLMs Mar 1, 2025 Image to text Image-to-Text Retrieval
— Unverified 0Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation Jan 1, 2025 image-classification Image Classification
— Unverified 0DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding Dec 2, 2024 Caption Generation Domain Generalization
— Unverified 0Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization Oct 30, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization Sep 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 0GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models Jul 30, 2024 Image to text Image-to-Text Retrieval
Code Code Available 0Towards a text-based quantitative and explainable histopathology image analysis Jul 10, 2024 image-classification Image Classification
Code Code Available 0BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval Jun 14, 2024 Image Retrieval Image to text
Code Code Available 0Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning May 26, 2024 Image to text Image-to-Text Retrieval
— Unverified 0CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? Mar 7, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Accept the Modality Gap: An Exploration in the Hyperbolic Space Jan 1, 2024 Image to text Image-to-Text Retrieval
— Unverified 0Towards a Visual-Language Foundation Model for Computational Pathology Jul 24, 2023 Contrastive Learning image-classification
— Unverified 0Is Cross-modal Information Retrieval Possible without Training? Apr 20, 2023 Contrastive Learning Cross-Modal Information Retrieval
— Unverified 0Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images Mar 13, 2023 Common Sense Reasoning Explanation Generation
— Unverified 0HADA: A Graph-based Amalgamation Framework in Image-text Retrieval Jan 11, 2023 Graph Neural Network Image Retrieval
Code Code Available 0When are Lemons Purple? The Concept Association Bias of Vision-Language Models Dec 22, 2022 Attribute image-classification
— Unverified 0A survey on knowledge-enhanced multimodal learning Nov 19, 2022 Conditional Image Generation Factual Visual Question Answering
— Unverified 0ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training Sep 30, 2022 Computational Efficiency Contrastive Learning
Code Code Available 0Design of the topology for contrastive visual-textual alignment Sep 5, 2022 Contrastive Learning Image-to-Text Retrieval
Code Code Available 0