| GSVA: Generalized Segmentation via Multimodal Large Language Models | Dec 15, 2023 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 1 |
| Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation | Dec 13, 2023 | DescriptiveObject | CodeCode Available | 1 |
| Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions | Nov 28, 2023 | DisentanglementReferring Expression | CodeCode Available | 1 |
| GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs | Nov 8, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs | Oct 1, 2023 | Referring Expression | CodeCode Available | 1 |
| 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Aug 31, 2023 | NavigateReferring Expression | CodeCode Available | 1 |
| A Unified Framework for 3D Point Cloud Visual Grounding | Aug 23, 2023 | CPUGPU | CodeCode Available | 1 |
| RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D | Aug 23, 2023 | ObjectObject Tracking | CodeCode Available | 1 |
| March in Chat: Interactive Prompting for Remote Embodied Referring Expression | Aug 20, 2023 | Referring ExpressionVision and Language Navigation | CodeCode Available | 1 |
| Described Object Detection: Liberating Object Detection with Flexible Expressions | Jul 24, 2023 | Binary ClassificationDescribed Object Detection | CodeCode Available | 1 |
| RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation | Jul 3, 2023 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Advancing Referring Expression Segmentation Beyond Single Image | May 21, 2023 | Co-Salient Object DetectionObject | CodeCode Available | 1 |
| Zero-shot Referring Image Segmentation with Global-Local Context Features | Mar 31, 2023 | Image SegmentationReferring Expression | CodeCode Available | 1 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Layout-aware Dreamer for Embodied Referring Expression Grounding | Nov 30, 2022 | Common Sense ReasoningNavigate | CodeCode Available | 1 |
| TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation | Oct 19, 2022 | Instance SegmentationReferring Expression | CodeCode Available | 1 |
| SQA3D: Situated Question Answering in 3D Scenes | Oct 14, 2022 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment | Oct 9, 2022 | object-detectionObject Detection | CodeCode Available | 1 |
| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 |
| Correspondence Matters for Video Referring Expression Comprehension | Jul 21, 2022 | Contrastive LearningReferring Expression | CodeCode Available | 1 |
| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| GRIT: General Robust Image Task Benchmark | Apr 28, 2022 | Instance SegmentationKeypoint Detection | CodeCode Available | 1 |
| A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension | Apr 17, 2022 | Data AugmentationReferring Expression | CodeCode Available | 1 |
| The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts | Apr 12, 2022 | Referring Expression | CodeCode Available | 1 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| SeqTR: A Simple yet Universal Network for Visual Grounding | Mar 30, 2022 | DecoderReferring Expression | CodeCode Available | 1 |
| Image Segmentation Using Text and Image Prompts | Dec 18, 2021 | DecoderImage Segmentation | CodeCode Available | 1 |
| LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | Dec 4, 2021 | DecoderGeneralized Referring Expression Segmentation | CodeCode Available | 1 |
| Airbert: In-domain Pretraining for Vision-and-Language Navigation | Aug 20, 2021 | NavigateReferring Expression | CodeCode Available | 1 |
| Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression | Jun 19, 2021 | Instruction FollowingNavigate | CodeCode Available | 1 |
| Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding | Jun 8, 2021 | Referring ExpressionSentence | CodeCode Available | 1 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 |
| OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding | Mar 13, 2021 | Referring ExpressionReferring Expression Segmentation | CodeCode Available | 1 |
| Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning | Mar 9, 2021 | Deep Reinforcement LearningReferring Expression | CodeCode Available | 1 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| A Recurrent Vision-and-Language BERT for Navigation | Nov 26, 2020 | Decision MakingDecoder | CodeCode Available | 1 |
| Human-centric Spatio-Temporal Video Grounding With Visual Transformers | Nov 10, 2020 | Referring ExpressionSentence | CodeCode Available | 1 |
| Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding | Sep 3, 2020 | Referring ExpressionVocal Bursts Valence Prediction | CodeCode Available | 1 |
| URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark | Aug 1, 2020 | ObjectOne-shot visual object segmentation | CodeCode Available | 1 |
| Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies | Jul 31, 2020 | Head DetectionReferring Expression | CodeCode Available | 1 |
| Refer360^: A Referring Expression Recognition Dataset in 360^ Images | Jul 1, 2020 | Referring Expression | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions | May 4, 2020 | Contrastive LearningMulti-Task Learning | CodeCode Available | 1 |
| Graph-Structured Referring Expression Reasoning in The Wild | Apr 19, 2020 | Referring Expression | CodeCode Available | 1 |
| Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation | Mar 19, 2020 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |