| 4th PVUW MeViS 3rd Place Report: Sa2VA | Apr 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 5 |
| Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection | Mar 9, 2023 | DecoderObject Detection | CodeCode Available | 5 |
| Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V | Oct 17, 2023 | Interactive SegmentationReferring Expression | CodeCode Available | 4 |
| RemoteSAM: Towards Segment Anything for Earth Observation | May 23, 2025 | AttributeEarth Observation | CodeCode Available | 3 |
| Towards Visual Grounding: A Survey | Dec 28, 2024 | Phrase GroundingReferring Expression | CodeCode Available | 3 |
| EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | Jun 28, 2024 | Interactive SegmentationLanguage Modeling | CodeCode Available | 3 |
| PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model | Mar 21, 2024 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 3 |
| Universal Instance Perception as Object Discovery and Retrieval | Mar 12, 2023 | Described Object DetectionGeneralized Referring Expression Comprehension | CodeCode Available | 3 |
| TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models | May 29, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 |
| GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding | Mar 13, 2025 | DiversityLanguage Modeling | CodeCode Available | 2 |
| Text4Seg: Reimagining Image Segmentation as Text Generation | Oct 13, 2024 | Image SegmentationReferring Expression | CodeCode Available | 2 |
| SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation | Sep 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models | Jun 24, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 |
| F-LMM: Grounding Frozen Large Multimodal Models | Jun 9, 2024 | General KnowledgeInstruction Following | CodeCode Available | 2 |
| Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation | Apr 4, 2024 | Contrastive LearningReferring Expression | CodeCode Available | 2 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 |
| Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation | Jan 1, 2024 | DescriptiveObject | CodeCode Available | 2 |
| NExT-Chat: An LMM for Chat, Detection and Segmentation | Nov 8, 2023 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 2 |
| GLaMM: Pixel Grounding Large Multimodal Model | Nov 6, 2023 | Conversational Question AnsweringImage Captioning | CodeCode Available | 2 |
| GREC: Generalized Referring Expression Comprehension | Aug 30, 2023 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 2 |
| GRES: Generalized Referring Expression Segmentation | Jun 1, 2023 | Generalized Referring Expression SegmentationReferring Expression | CodeCode Available | 2 |
| MDETR - Modulated Detection for End-to-End Multi-Modal Understanding | Jan 1, 2021 | Phrase GroundingQuestion Answering | CodeCode Available | 2 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Mar 16, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 |
| New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration | Feb 27, 2025 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? | Feb 6, 2025 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes | Feb 1, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning | Feb 1, 2025 | Referring ExpressionVisual Grounding | CodeCode Available | 1 |
| Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | Jan 12, 2025 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation | Jan 9, 2025 | DecoderReferring Expression | CodeCode Available | 1 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Jan 1, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation | Dec 3, 2024 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 1 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE | Sep 26, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation | Sep 20, 2024 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension | Sep 20, 2024 | cross-modal alignmentReferring Expression | CodeCode Available | 1 |
| LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension | Sep 18, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| 3D-GRES: Generalized 3D Referring Expression Segmentation | Jul 30, 2024 | ObjectReferring Expression | CodeCode Available | 1 |
| Multi-branch Collaborative Learning Network for 3D Visual Grounding | Jul 7, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 |
| Referring Atomic Video Action Recognition | Jul 2, 2024 | Action LocalizationAction Recognition | CodeCode Available | 1 |
| SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation | Jun 3, 2024 | Pseudo LabelReferring Expression | CodeCode Available | 1 |
| CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation | May 24, 2024 | Generalized Referring Expression SegmentationObject | CodeCode Available | 1 |
| Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | May 21, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 |
| DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM | Mar 19, 2024 | Objectobject-detection | CodeCode Available | 1 |
| Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception | Mar 5, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | Feb 15, 2024 | Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition | CodeCode Available | 1 |
| An Open and Comprehensive Pipeline for Unified Object Grounding and Detection | Jan 4, 2024 | Described Object DetectionPhrase Grounding | CodeCode Available | 1 |
| Referring Expression Counting | Jan 1, 2024 | 8kobject-detection | CodeCode Available | 1 |
| Tune-An-Ellipse: CLIP Has Potential to Find What You Want | Jan 1, 2024 | ObjectReferring Expression | CodeCode Available | 1 |