SOTAVerified

Referring Expression

Referring expressions places a bounding box around the instance corresponding to the provided description and image.

Papers

Showing 76100 of 364 papers

TitleStatusHype
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word EmphasisCode1
Kosmos-2: Grounding Multimodal Large Language Models to the WorldCode1
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression SegmentationCode1
Graph-Structured Referring Expression Reasoning in The WildCode1
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
Improving Visual Grounding by Encouraging Consistent Gradient-based ExplanationsCode1
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement LearningCode1
Exploring Contextual Attribute Density in Referring Expression CountingCode1
Exploring Contextual Attribute Density in Referring Expression CountingCode1
Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image SegmentationCode1
Large-Scale Adversarial Training for Vision-and-Language Representation LearningCode1
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMCode1
Described Object Detection: Liberating Object Detection with Flexible ExpressionsCode1
Image Segmentation Using Text and Image PromptsCode1
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone ScenesCode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression SegmentationCode1
Refer360^: A Referring Expression Recognition Dataset in 360^ ImagesCode1
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEsCode1
Human-centric Spatio-Temporal Video Grounding With Visual TransformersCode1
LAVT: Language-Aware Vision Transformer for Referring Image SegmentationCode1
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic ReasoningCode1
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object SegmentationCode1
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsCode1
Show:102550
← PrevPage 4 of 15Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1RandomAcc@0.5m14.6Unverified