| Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection | Mar 9, 2023 | DecoderObject Detection | CodeCode Available | 5 | 5 |
| 4th PVUW MeViS 3rd Place Report: Sa2VA | Apr 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 5 | 5 |
| Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V | Oct 17, 2023 | Interactive SegmentationReferring Expression | CodeCode Available | 4 | 5 |
| PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model | Mar 21, 2024 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 3 | 5 |
| RemoteSAM: Towards Segment Anything for Earth Observation | May 23, 2025 | AttributeEarth Observation | CodeCode Available | 3 | 5 |
| Towards Visual Grounding: A Survey | Dec 28, 2024 | Phrase GroundingReferring Expression | CodeCode Available | 3 | 5 |
| Universal Instance Perception as Object Discovery and Retrieval | Mar 12, 2023 | Described Object DetectionGeneralized Referring Expression Comprehension | CodeCode Available | 3 | 5 |
| EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | Jun 28, 2024 | Interactive SegmentationLanguage Modeling | CodeCode Available | 3 | 5 |
| GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding | Mar 13, 2025 | DiversityLanguage Modeling | CodeCode Available | 2 | 5 |
| Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation | Apr 4, 2024 | Contrastive LearningReferring Expression | CodeCode Available | 2 | 5 |
| F-LMM: Grounding Frozen Large Multimodal Models | Jun 9, 2024 | General KnowledgeInstruction Following | CodeCode Available | 2 | 5 |
| GLaMM: Pixel Grounding Large Multimodal Model | Nov 6, 2023 | Conversational Question AnsweringImage Captioning | CodeCode Available | 2 | 5 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 | 5 |
| TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models | May 29, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 | 5 |
| Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models | Jun 24, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 2 | 5 |
| MDETR - Modulated Detection for End-to-End Multi-Modal Understanding | Jan 1, 2021 | Phrase GroundingQuestion Answering | CodeCode Available | 2 | 5 |
| GRES: Generalized Referring Expression Segmentation | Jun 1, 2023 | Generalized Referring Expression SegmentationReferring Expression | CodeCode Available | 2 | 5 |
| Text4Seg: Reimagining Image Segmentation as Text Generation | Oct 13, 2024 | Image SegmentationReferring Expression | CodeCode Available | 2 | 5 |
| GREC: Generalized Referring Expression Comprehension | Aug 30, 2023 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 2 | 5 |
| NExT-Chat: An LMM for Chat, Detection and Segmentation | Nov 8, 2023 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 2 | 5 |
| SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation | Sep 1, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation | Jan 1, 2024 | DescriptiveObject | CodeCode Available | 2 | 5 |
| MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension | Sep 20, 2024 | cross-modal alignmentReferring Expression | CodeCode Available | 1 | 5 |
| LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | Feb 15, 2024 | Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition | CodeCode Available | 1 | 5 |
| March in Chat: Interactive Prompting for Remote Embodied Referring Expression | Aug 20, 2023 | Referring ExpressionVision and Language Navigation | CodeCode Available | 1 | 5 |
| LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | Dec 4, 2021 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 1 | 5 |
| A Unified Framework for 3D Point Cloud Visual Grounding | Aug 23, 2023 | CPUGPU | CodeCode Available | 1 | 5 |
| Airbert: In-domain Pretraining for Vision-and-Language Navigation | Aug 20, 2021 | NavigateReferring Expression | CodeCode Available | 1 | 5 |
| LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension | Sep 18, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 | 5 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| Layout-aware Dreamer for Embodied Referring Expression Grounding | Nov 30, 2022 | Common Sense ReasoningNavigate | CodeCode Available | 1 | 5 |
| A Fast and Accurate One-Stage Approach to Visual Grounding | Aug 18, 2019 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 | 5 |
| A Recurrent Vision-and-Language BERT for Navigation | Nov 26, 2020 | Decision MakingDecoder | CodeCode Available | 1 | 5 |
| Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation | Sep 20, 2024 | Image SegmentationReferring Expression | CodeCode Available | 1 | 5 |
| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 | 5 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 | 5 |
| Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning | Mar 9, 2021 | Deep Reinforcement LearningReferring Expression | CodeCode Available | 1 | 5 |
| Human-centric Spatio-Temporal Video Grounding With Visual Transformers | Nov 10, 2020 | Referring ExpressionSentence | CodeCode Available | 1 | 5 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 | 5 |
| An Open and Comprehensive Pipeline for Unified Object Grounding and Detection | Jan 4, 2024 | Described Object DetectionPhrase Grounding | CodeCode Available | 1 | 5 |
| Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding | Mar 29, 2017 | Referring Expression | CodeCode Available | 1 | 5 |
| Advancing Referring Expression Segmentation Beyond Single Image | May 21, 2023 | Co-Salient Object DetectionObject | CodeCode Available | 1 | 5 |
| DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM | Mar 19, 2024 | Objectobject-detection | CodeCode Available | 1 | 5 |
| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation | Jan 9, 2025 | DecoderReferring Expression | CodeCode Available | 1 | 5 |
| Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding | Jun 8, 2021 | Referring ExpressionSentence | CodeCode Available | 1 | 5 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Mar 16, 2025 | AttributeReferring Expression | CodeCode Available | 1 | 5 |
| CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation | May 24, 2024 | Generalized Referring Expression SegmentationObject | CodeCode Available | 1 | 5 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |