SOTAVerified

Text Matching

Matching a target text to a source text based on their meaning.

Papers

Showing 150 of 364 papers

TitleStatusHype
ColPali: Efficient Document Retrieval with Vision Language ModelsCode7
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text MatchingCode2
FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable LocalizationCode2
LLaQo: Towards a Query-Based Coach in Expressive Music Performance AssessmentCode2
Do You Remember? Dense Video Captioning with Cross-Modal Memory RetrievalCode2
MouSi: Poly-Visual-Expert Vision-Language ModelsCode2
3D-VisTA: Pre-trained Transformer for 3D Vision and Text AlignmentCode2
A Systematic Survey of Prompt Engineering on Vision-Language Foundation ModelsCode2
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person RetrievalCode2
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text SpottingCode2
Language Models Can See: Plugging Visual Controls in Text GenerationCode2
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIPCode1
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word EmphasisCode1
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object RepresentationCode1
TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action RecognitionCode1
Teach CLIP to Develop a Number Sense for Ordinal RegressionCode1
Image-text matching for large-scale book collectionsCode1
Composing Object Relations and Attributes for Image-Text MatchingCode1
Revisiting Deep Audio-Text Retrieval Through the Lens of TransportationCode1
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text MatchingCode1
Narrative Action Evaluation with Prompt-Guided Multimodal InteractionCode1
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-trainingCode1
ColorSwap: A Color and Word Order Dataset for Multimodal EvaluationCode1
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language ModelsCode1
MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction ExpertsCode1
Cross-modal Active Complementary Learning with Self-refining CorrespondenceCode1
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression SegmentationCode1
Text Matching Improves Sequential Recommendation by Reducing Popularity BiasesCode1
KETM:A Knowledge-Enhanced Text Matching methodCode1
Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative EliminationCode1
Advancing Visual Grounding with Scene Knowledge: Benchmark and MethodCode1
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language UnderstandingCode1
Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search BenchmarkCode1
Revisiting the Role of Language Priors in Vision-Language ModelsCode1
Improved Probabilistic Image-Text RepresentationsCode1
Are Diffusion Models Vision-And-Language Reasoners?Code1
UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based RecommendationCode1
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language LearnersCode1
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis EvaluationCode1
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured RepresentationsCode1
Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report GenerationCode1
Plug-and-Play Regulators for Image-Text MatchingCode1
BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity ConsistencyCode1
BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP for Generic Natural Visual Stimulus DecodingCode1
Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning NetworkCode1
Learning Semantic Relationship Among Instances for Image-Text MatchingCode1
ComCLIP: Training-Free Compositional Image and Text MatchingCode1
Self-supervised vision-language pretraining for Medical visual question answeringCode1
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training ModelCode1
Show:102550
← PrevPage 1 of 8Next →

No leaderboard results yet.