| ColPali: Efficient Document Retrieval with Vision Language Models | Jun 27, 2024 | document understandingRAG | CodeCode Available | 7 |
| Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching | Mar 19, 2025 | Image-text matchingText Matching | CodeCode Available | 2 |
| FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization | Jan 17, 2025 | Anomaly DetectionImage-text matching | CodeCode Available | 2 |
| LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment | Sep 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval | Apr 11, 2024 | DecoderDense Video Captioning | CodeCode Available | 2 |
| MouSi: Poly-Visual-Expert Vision-Language Models | Jan 30, 2024 | Image SegmentationImage-text matching | CodeCode Available | 2 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models | Jul 24, 2023 | Image GenerationImage-text matching | CodeCode Available | 2 |
| Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval | Mar 22, 2023 | Image-text matchingLanguage Modeling | CodeCode Available | 2 |
| DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting | Nov 19, 2022 | DecoderScene Text Detection | CodeCode Available | 2 |
| Language Models Can See: Plugging Visual Controls in Text Generation | May 5, 2022 | Image CaptioningImage-text matching | CodeCode Available | 2 |
| Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Jun 10, 2025 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP | Mar 5, 2025 | Adversarial RobustnessImage-text matching | CodeCode Available | 1 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 |
| CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation | Feb 27, 2025 | Image-text matchingObject | CodeCode Available | 1 |
| TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition | Nov 16, 2024 | Action RecognitionSkeleton Based Action Recognition | CodeCode Available | 1 |
| Teach CLIP to Develop a Number Sense for Ordinal Regression | Aug 7, 2024 | regressionText Matching | CodeCode Available | 1 |
| Image-text matching for large-scale book collections | Jul 29, 2024 | Image-text matchingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Composing Object Relations and Attributes for Image-Text Matching | Jun 17, 2024 | AttributeGraph Attention | CodeCode Available | 1 |
| Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation | May 16, 2024 | AudioCapsEvent Detection | CodeCode Available | 1 |
| Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching | Apr 28, 2024 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Narrative Action Evaluation with Prompt-Guided Multimodal Interaction | Apr 22, 2024 | Action Quality Assessmentmultimodal interaction | CodeCode Available | 1 |
| RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training | Mar 15, 2024 | Diagnosticimage-classification | CodeCode Available | 1 |
| ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation | Feb 7, 2024 | Image GenerationImage-text matching | CodeCode Available | 1 |
| Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | Nov 28, 2023 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts | Nov 16, 2023 | Binary ClassificationDescriptive | CodeCode Available | 1 |
| Cross-modal Active Complementary Learning with Self-refining Correspondence | Oct 26, 2023 | Cross-modal retrieval with noisy correspondenceImage-text matching | CodeCode Available | 1 |
| 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Aug 31, 2023 | NavigateReferring Expression | CodeCode Available | 1 |
| Text Matching Improves Sequential Recommendation by Reducing Popularity Biases | Aug 27, 2023 | Recommendation SystemsSequential Recommendation | CodeCode Available | 1 |
| KETM:A Knowledge-Enhanced Text Matching method | Aug 11, 2023 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination | Aug 8, 2023 | Image-text matchingRepresentation Learning | CodeCode Available | 1 |
| Advancing Visual Grounding with Scene Knowledge: Benchmark and Method | Jul 21, 2023 | Image-text matchingText Matching | CodeCode Available | 1 |
| UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding | Jul 3, 2023 | Image-text matchingSentence | CodeCode Available | 1 |
| Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark | Jun 5, 2023 | AttributeImage-text matching | CodeCode Available | 1 |
| Revisiting the Role of Language Priors in Vision-Language Models | Jun 2, 2023 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Improved Probabilistic Image-Text Representations | May 29, 2023 | Data AugmentationImage-text matching | CodeCode Available | 1 |
| Are Diffusion Models Vision-And-Language Reasoners? | May 25, 2023 | DenoisingImage Generation | CodeCode Available | 1 |
| UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation | May 25, 2023 | Contrastive LearningText Matching | CodeCode Available | 1 |
| Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners | May 18, 2023 | Image GenerationImage-text matching | CodeCode Available | 1 |
| LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation | May 18, 2023 | AttributeImage Generation | CodeCode Available | 1 |
| Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations | May 6, 2023 | Image-text matchingText Matching | CodeCode Available | 1 |
| Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation | Mar 29, 2023 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| Plug-and-Play Regulators for Image-Text Matching | Mar 23, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency | Mar 22, 2023 | Cross-modal retrieval with noisy correspondenceImage-text matching | CodeCode Available | 1 |
| BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding | Feb 25, 2023 | Brain DecodingImage Generation | CodeCode Available | 1 |
| Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network | Jan 1, 2023 | Image-text matchingRetrieval | CodeCode Available | 1 |
| Learning Semantic Relationship Among Instances for Image-Text Matching | Jan 1, 2023 | Cross-Modal RetrievalImage Retrieval | CodeCode Available | 1 |
| ComCLIP: Training-Free Compositional Image and Text Matching | Nov 25, 2022 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Self-supervised vision-language pretraining for Medical visual question answering | Nov 24, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model | Oct 11, 2022 | Contrastive LearningImage-text matching | CodeCode Available | 1 |