SOTAVerified

Referring Expression

Referring expressions places a bounding box around the instance corresponding to the provided description and image.

Papers

Showing 51100 of 364 papers

TitleStatusHype
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingCode1
Modeling Context in Referring ExpressionsCode1
Correspondence Matters for Video Referring Expression ComprehensionCode1
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsCode1
RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression SegmentationCode1
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression GroundingCode1
Relationship-Embedded Representation Learning for Grounding Referring ExpressionsCode1
Airbert: In-domain Pretraining for Vision-and-Language NavigationCode1
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object SegmentationCode1
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring ExpressionCode1
Refer360^: A Referring Expression Recognition Dataset in 360^ ImagesCode1
GSVA: Generalized Segmentation via Multimodal Large Language ModelsCode1
Referring Atomic Video Action RecognitionCode1
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone ScenesCode1
3D-GRES: Generalized 3D Referring Expression SegmentationCode1
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4DCode1
Referring Expression CountingCode1
Discriminative Triad Matching and Reconstruction for Weakly Referring Expression GroundingCode1
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsCode1
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression SegmentationCode1
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionCode1
OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene GroundingCode1
GRIT: General Robust Image Task BenchmarkCode1
Graph-Structured Referring Expression Reasoning in The WildCode1
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?Code1
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMCode1
Human-centric Spatio-Temporal Video Grounding With Visual TransformersCode1
Described Object Detection: Liberating Object Detection with Flexible ExpressionsCode1
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression SegmentationCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
Exploring Contextual Attribute Density in Referring Expression CountingCode1
Exploring Contextual Attribute Density in Referring Expression CountingCode1
Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image SegmentationCode1
Advancing Referring Expression Segmentation Beyond Single ImageCode1
IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression SegmentationCode1
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement LearningCode1
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word EmphasisCode1
LAVT: Language-Aware Vision Transformer for Referring Image SegmentationCode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
Learning to Evaluate Performance of Multi-modal Semantic LocalizationCode1
Image Segmentation Using Text and Image PromptsCode1
Referring Transformer: A One-step Approach to Multi-task Visual GroundingCode1
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression ComprehensionCode1
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity RecognitionCode1
A Fast and Accurate One-Stage Approach to Visual GroundingCode1
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsCode1
Show:102550
← PrevPage 2 of 8Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1RandomAcc@0.5m14.6Unverified