| Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval | Jun 28, 2025 | Cross-Modal RetrievalImage Captioning | —Unverified | 0 |
| Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models | Jun 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Referring Expression Instance Retrieval and A Strong End-to-End Baseline | Jun 23, 2025 | Image RetrievalReferring Expression | —Unverified | 0 |
| Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation | Jun 12, 2025 | Referring Expression | —Unverified | 0 |
| Synthetic Visual Genome | Jun 9, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Refer to Anything with Vision-Language Prompts | Jun 5, 2025 | BenchmarkingGeneralized Referring Expression Segmentation | —Unverified | 0 |
| From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes | Jun 5, 2025 | 3D visual groundingObject | —Unverified | 0 |
| Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning | Jun 4, 2025 | ObjectReferring Expression | —Unverified | 0 |
| RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions | Jun 3, 2025 | Referring ExpressionSynthetic Data Generation | —Unverified | 0 |
| TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models | May 29, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 |
| Improving Contrastive Learning for Referring Expression Counting | May 28, 2025 | Contrastive LearningObject Counting | CodeCode Available | 0 |
| Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model | May 25, 2025 | cross-modal alignmentImage Segmentation | —Unverified | 0 |
| WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation | May 24, 2025 | Contrastive LearningReferring Expression | CodeCode Available | 0 |
| RemoteSAM: Towards Segment Anything for Earth Observation | May 23, 2025 | AttributeEarth Observation | CodeCode Available | 3 |
| Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models | May 12, 2025 | NavigateReferring Expression | —Unverified | 0 |
| RESAnything: Attribute Prompting for Arbitrary Referring Segmentation | May 3, 2025 | AttributeImage Segmentation | —Unverified | 0 |
| Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation | Apr 22, 2025 | Referring ExpressionReferring expression generation | CodeCode Available | 0 |
| LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation | Apr 20, 2025 | AttributeImage Segmentation | —Unverified | 0 |
| 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation | Apr 17, 2025 | Referring ExpressionReferring Expression Segmentation | —Unverified | 0 |
| Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities | Apr 2, 2025 | DescriptiveLarge Language Model | CodeCode Available | 0 |
| 4th PVUW MeViS 3rd Place Report: Sa2VA | Apr 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 5 |
| MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing | Mar 31, 2025 | Objectobject-detection | CodeCode Available | 0 |
| Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding | Mar 25, 2025 | AttributeObject | —Unverified | 0 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Mar 16, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| Cognitive Disentanglement for Referring Multi-Object Tracking | Mar 14, 2025 | DisentanglementMulti-Object Tracking | —Unverified | 0 |
| GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding | Mar 13, 2025 | DiversityLanguage Modeling | CodeCode Available | 2 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 |
| New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration | Feb 27, 2025 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? | Feb 6, 2025 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes | Feb 1, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning | Feb 1, 2025 | Referring ExpressionVisual Grounding | CodeCode Available | 1 |
| Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities | Jan 22, 2025 | BenchmarkingReferring Expression | —Unverified | 0 |
| FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis | Jan 17, 2025 | Bayesian InferenceLanguage Modeling | —Unverified | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | Jan 12, 2025 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation | Jan 9, 2025 | DecoderReferring Expression | CodeCode Available | 1 |
| Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension | Jan 2, 2025 | Generalized Referring Expression ComprehensionGeneralized Referring Expression Segmentation | —Unverified | 0 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Jan 1, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension | Jan 1, 2025 | DescriptiveReferring Expression | —Unverified | 0 |
| Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding | Jan 1, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Towards Visual Grounding: A Survey | Dec 28, 2024 | Phrase GroundingReferring Expression | CodeCode Available | 3 |
| RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation | Dec 3, 2024 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 1 |
| Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension | Nov 22, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Instance-Aware Generalized Referring Expression Segmentation | Nov 22, 2024 | Generalized Referring Expression SegmentationObject | —Unverified | 0 |
| Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation | Nov 3, 2024 | Data AugmentationImage Segmentation | —Unverified | 0 |
| SegLLM: Multi-round Reasoning Segmentation | Oct 24, 2024 | Reasoning SegmentationReferring Expression | —Unverified | 0 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | CodeCode Available | 0 |
| Text4Seg: Reimagining Image Segmentation as Text Generation | Oct 13, 2024 | Image SegmentationReferring Expression | CodeCode Available | 2 |