SOTAVerified

Text Retrieval

Text Retrieval is the task of finding the most text result (such as an answer, paragraph, or passage) given a query (which could be a question, keywords, or any relevant text)

Papers

Showing 150 of 671 papers

TitleStatusHype
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval0
Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese RegulationsCode0
GLAP: General contrastive audio-text pretraining across domains and languagesCode2
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention RecyclingCode0
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration0
TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document ReasoningCode2
Adding simple structure at inference improves Vision-Language CompositionalityCode0
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model EvaluationCode2
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text RetrievalCode1
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision ModelsCode1
Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts0
Attacking Attention of Foundation Models Disrupts Downstream TasksCode0
ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase GenerationCode0
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object TrajectoryCode2
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval0
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation0
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models0
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval0
LoVR: A Benchmark for Long Video Retrieval in Multimodal ContextsCode1
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models0
mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge GraphsCode1
Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution0
Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late InteractionCode0
A Vision-Language Foundation Model for Leaf Disease IdentificationCode0
FG-CLIP: Fine-Grained Visual and Textual AlignmentCode4
QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort0
AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection0
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs0
Towards Understanding Camera Motions in Any Video0
SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs0
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation0
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations0
Bridging Queries and Tables through Entities in Table Retrieval0
LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders0
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval0
M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAPCode0
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI0
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image AnalysisCode2
GOAL: Global-local Object Alignment LearningCode1
Anatomy-Aware Conditional Image-Text Retrieval0
Bridging Classical and Quantum String Matching: A Computational Reformulation of Bit-Parallelism0
Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings0
Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective0
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning0
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts0
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations0
ABC: Achieving Better Control of Multimodal Embeddings using VLMs0
How Vital is the Jurisprudential Relevance: Law Article Intervened Legal Case Retrieval and Matching0
Progressive Local Alignment for Medical Multimodal Pre-training0
Show:102550
← PrevPage 1 of 14Next →

No leaderboard results yet.