SOTAVerified

Referring Expression Comprehension

Papers

Showing 150 of 167 papers

TitleStatusHype
Referring Expression Instance Retrieval and A Strong End-to-End Baseline0
Synthetic Visual Genome0
TextRegion: Text-Aligned Region Tokens from Frozen Image-Text ModelsCode2
WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and SegmentationCode0
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language ModelCode9
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing0
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM CollaborationCode1
Exploring Spatial Language Grounding Through Referring Expressions0
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone ScenesCode1
FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis0
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks0
Multi-task Visual Grounding with Coarse-to-Fine Consistency ConstraintsCode1
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension0
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding0
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension0
Towards Visual Grounding: A SurveyCode3
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingCode9
Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension0
Frontiers in Intelligent ColonoscopyCode2
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal ModelsCode0
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal FusionCode2
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoECode1
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionCode1
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression ComprehensionCode1
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression ComprehensionCode1
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression0
A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection TrainingCode0
Revisiting Multi-Modal LLM Evaluation0
MaskInversion: Localized Embeddings via Optimization of Explainability Maps0
Learning Visual Grounding from Generative Vision and Language Model0
Multi-branch Collaborative Learning Network for 3D Visual GroundingCode1
The Solution for the 5th GCAIAC Zero-shot Referring Expression Comprehension Challenge0
M^2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension0
Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO0
ScanFormer: Referring Expression Comprehension by Iteratively Scanning0
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal ModelsCode2
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression ComprehensionCode1
Adversarial Robustness for Visual Grounding of Multimodal Large Language ModelsCode0
Text-driven Affordance Learning from Egocentric Vision0
Mini-Gemini: Mining the Potential of Multi-modality Vision Language ModelsCode7
PropTest: Automatic Property Testing for Improved Visual Programming0
Elysium: Exploring Object-level Perception in Videos via MLLMCode2
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLMCode1
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar0
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training0
Efficient Multimodal Learning from Data-centric PerspectiveCode5
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity RecognitionCode1
An Open and Comprehensive Pipeline for Unified Object Grounding and DetectionCode1
Revisiting Counterfactual Problems in Referring Expression ComprehensionCode0
Show:102550
← PrevPage 1 of 4Next →

No leaderboard results yet.