SOTAVerified

Referring Expression Comprehension

Papers

Showing 2650 of 167 papers

TitleStatusHype
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoECode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression ComprehensionCode1
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression ComprehensionCode1
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression ComprehensionCode1
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMCode1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity RecognitionCode1
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionCode1
Tune-An-Ellipse: CLIP Has Potential to Find What You WantCode1
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and CaptionsCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
InstructDET: Diversifying Referring Object Detection with Generalized InstructionsCode1
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4DCode1
A Unified Framework for 3D Point Cloud Visual GroundingCode1
Described Object Detection: Liberating Object Detection with Flexible ExpressionsCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
NS3D: Neuro-Symbolic Grounding of 3D Objects and RelationsCode1
PolyFormer: Referring Image Segmentation as Sequential Polygon GenerationCode1
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and GroundingCode1
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun DistillationCode1
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature AlignmentCode1
Learning to Evaluate Performance of Multi-modal Semantic LocalizationCode1
Correspondence Matters for Video Referring Expression ComprehensionCode1
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsCode1
Show:102550
← PrevPage 2 of 7Next →

No leaderboard results yet.