| ColPali: Efficient Document Retrieval with Vision Language Models | Jun 27, 2024 | document understandingRAG | CodeCode Available | 7 |
| Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval | Mar 22, 2023 | Image-text matchingLanguage Modeling | CodeCode Available | 2 |
| Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching | Mar 19, 2025 | Image-text matchingText Matching | CodeCode Available | 2 |
| 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment | Aug 8, 2023 | 3D Question Answering (3D-QA)Dense Captioning | CodeCode Available | 2 |
| Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval | Apr 11, 2024 | DecoderDense Video Captioning | CodeCode Available | 2 |
| MouSi: Poly-Visual-Expert Vision-Language Models | Jan 30, 2024 | Image SegmentationImage-text matching | CodeCode Available | 2 |
| LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment | Sep 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization | Jan 17, 2025 | Anomaly DetectionImage-text matching | CodeCode Available | 2 |
| A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models | Jul 24, 2023 | Image GenerationImage-text matching | CodeCode Available | 2 |
| Language Models Can See: Plugging Visual Controls in Text Generation | May 5, 2022 | Image CaptioningImage-text matching | CodeCode Available | 2 |
| DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting | Nov 19, 2022 | DecoderScene Text Detection | CodeCode Available | 2 |
| ActionCLIP: A New Paradigm for Video Action Recognition | Sep 17, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| KETM:A Knowledge-Enhanced Text Matching method | Aug 11, 2023 | Common Sense ReasoningQuestion Answering | CodeCode Available | 1 |
| Knowledge Guided Text Retrieval and Reading for Open Domain Question Answering | Nov 10, 2019 | Natural QuestionsOpen-Domain Question Answering | CodeCode Available | 1 |
| Image-text matching for large-scale book collections | Jul 29, 2024 | Image-text matchingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Identifying Machine-Paraphrased Plagiarism | Mar 22, 2021 | ArticlesText Matching | CodeCode Available | 1 |
| Improved Probabilistic Image-Text Representations | May 29, 2023 | Data AugmentationImage-text matching | CodeCode Available | 1 |
| Lattice CNNs for Matching Based Chinese Question Answering | Feb 25, 2019 | DiversityQuestion Answering | CodeCode Available | 1 |
| Extractive Summarization as Text Matching | Apr 19, 2020 | Document SummarizationExtractive Summarization | CodeCode Available | 1 |
| Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning | Mar 1, 2020 | Cross-Modal RetrievalRetrieval | CodeCode Available | 1 |
| Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners | May 18, 2023 | Image GenerationImage-text matching | CodeCode Available | 1 |
| Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network | Jan 1, 2023 | Image-text matchingRetrieval | CodeCode Available | 1 |
| Composing Object Relations and Attributes for Image-Text Matching | Jun 17, 2024 | AttributeGraph Attention | CodeCode Available | 1 |
| HANet: Hierarchical Alignment Networks for Video-Text Retrieval | Jul 26, 2021 | RetrievalText Matching | CodeCode Available | 1 |
| A Dense Representation Framework for Lexical and Semantic Matching | Jun 20, 2022 | RetrievalSemantic Text Matching | CodeCode Available | 1 |
| Are Diffusion Models Vision-And-Language Reasoners? | May 25, 2023 | DenoisingImage Generation | CodeCode Available | 1 |
| DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis | Aug 13, 2020 | Image GenerationText Matching | CodeCode Available | 1 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 |
| Keyword-Attentive Deep Semantic Matching | Mar 11, 2020 | RetrievalText Matching | CodeCode Available | 1 |
| ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO | Apr 7, 2022 | Image-text matchingText Matching | CodeCode Available | 1 |
| Graph Structured Network for Image-Text Matching | Apr 1, 2020 | AttributeCross-Modal Retrieval | CodeCode Available | 1 |
| Deep Multimodal Neural Architecture Search | Apr 25, 2020 | DecoderImage-text matching | CodeCode Available | 1 |
| ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation | Feb 7, 2024 | Image GenerationImage-text matching | CodeCode Available | 1 |
| Declaration-based Prompt Tuning for Visual Question Answering | May 5, 2022 | Image-text matchingLanguage Modeling | CodeCode Available | 1 |
| ComCLIP: Training-Free Compositional Image and Text Matching | Nov 25, 2022 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Advancing Visual Grounding with Scene Knowledge: Benchmark and Method | Jul 21, 2023 | Image-text matchingText Matching | CodeCode Available | 1 |
| Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching | Apr 28, 2024 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency | Mar 22, 2023 | Cross-modal retrieval with noisy correspondenceImage-text matching | CodeCode Available | 1 |
| BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus Decoding | Feb 25, 2023 | Brain DecodingImage Generation | CodeCode Available | 1 |
| A Comparison of Supervised Learning to Match Methods for Product Search | Jul 20, 2020 | ARCAttribute | CodeCode Available | 1 |
| Consensus-Aware Visual-Semantic Embedding for Image-Text Matching | Jul 17, 2020 | Image CaptioningImage-text matching | CodeCode Available | 1 |
| AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks | Nov 28, 2017 | Generative Adversarial NetworkImage Generation | CodeCode Available | 1 |
| Adaptive Offline Quintuplet Loss for Image-Text Matching | Mar 7, 2020 | Image-text matchingText Matching | CodeCode Available | 1 |
| Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Jun 10, 2025 | Contrastive LearningImage-text matching | CodeCode Available | 1 |
| Cross-modal Active Complementary Learning with Self-refining Correspondence | Oct 26, 2023 | Cross-modal retrieval with noisy correspondenceImage-text matching | CodeCode Available | 1 |
| CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP | Mar 5, 2025 | Adversarial RobustnessImage-text matching | CodeCode Available | 1 |
| A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval | Jun 4, 2021 | Graph MatchingImage Retrieval | CodeCode Available | 1 |
| CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation | Feb 27, 2025 | Image-text matchingObject | CodeCode Available | 1 |
| 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Aug 31, 2023 | NavigateReferring Expression | CodeCode Available | 1 |
| DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting | Dec 2, 2021 | Image-text matchingInstance Segmentation | CodeCode Available | 1 |