| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | May 21, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 |
| Correspondence Matters for Video Referring Expression Comprehension | Jul 21, 2022 | Contrastive LearningReferring Expression | CodeCode Available | 1 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding | Mar 13, 2021 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 1 |
| Airbert: In-domain Pretraining for Vision-and-Language Navigation | Aug 20, 2021 | NavigateReferring Expression | CodeCode Available | 1 |
| Tune-An-Ellipse: CLIP Has Potential to Find What You Want | Jan 1, 2024 | ObjectReferring Expression | CodeCode Available | 1 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | Oct 1, 2023 | Referring Expression | CodeCode Available | 1 |
| Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding | Sep 3, 2020 | Referring ExpressionVocal Bursts Valence Prediction | CodeCode Available | 1 |
| Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception | Mar 5, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Modeling Context in Referring Expressions | Jul 31, 2016 | Referring ExpressionReferring expression generation | CodeCode Available | 1 |
| Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation | Mar 19, 2020 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 1 |
| MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension | Sep 20, 2024 | cross-modal alignmentReferring Expression | CodeCode Available | 1 |
| 3D-GRES: Generalized 3D Referring Expression Segmentation | Jul 30, 2024 | ObjectReferring Expression | CodeCode Available | 1 |
| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 |
| March in Chat: Interactive Prompting for Remote Embodied Referring Expression | Aug 20, 2023 | Referring ExpressionVision and Language Navigation | CodeCode Available | 1 |
| Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | Jan 12, 2025 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding | Jun 8, 2021 | Referring ExpressionSentence | CodeCode Available | 1 |
| Layout-aware Dreamer for Embodied Referring Expression Grounding | Nov 30, 2022 | Common Sense ReasoningNavigate | CodeCode Available | 1 |
| LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | Feb 15, 2024 | Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition | CodeCode Available | 1 |
| LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension | Sep 18, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 |
| IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis | Mar 2, 2025 | Image SegmentationImage-text matching | CodeCode Available | 1 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Multi-branch Collaborative Learning Network for 3D Visual Grounding | Jul 7, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 |
| IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation | Jan 9, 2025 | DecoderReferring Expression | CodeCode Available | 1 |
| Graph-Structured Referring Expression Reasoning in The Wild | Apr 19, 2020 | Referring Expression | CodeCode Available | 1 |
| New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration | Feb 27, 2025 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning | Mar 9, 2021 | Deep Reinforcement LearningReferring Expression | CodeCode Available | 1 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Jan 1, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| Exploring Contextual Attribute Density in Referring Expression Counting | Mar 16, 2025 | AttributeReferring Expression | CodeCode Available | 1 |
| Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation | Sep 20, 2024 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM | Mar 19, 2024 | Objectobject-detection | CodeCode Available | 1 |
| Described Object Detection: Liberating Object Detection with Flexible Expressions | Jul 24, 2023 | Binary ClassificationDescribed Object Detection | CodeCode Available | 1 |
| Image Segmentation Using Text and Image Prompts | Dec 18, 2021 | DecoderImage Segmentation | CodeCode Available | 1 |
| RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes | Feb 1, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension | Sep 23, 2024 | Image ComprehensionReferring Expression | CodeCode Available | 1 |
| 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Aug 31, 2023 | NavigateReferring Expression | CodeCode Available | 1 |
| Refer360^: A Referring Expression Recognition Dataset in 360^ Images | Jul 1, 2020 | Referring Expression | CodeCode Available | 1 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Human-centric Spatio-Temporal Video Grounding With Visual Transformers | Nov 10, 2020 | Referring ExpressionSentence | CodeCode Available | 1 |
| LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | Dec 4, 2021 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 1 |
| NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning | Feb 1, 2025 | Referring ExpressionVisual Grounding | CodeCode Available | 1 |
| RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation | Jul 3, 2023 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |