| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | Feb 14, 2023 | DecoderImage Segmentation | CodeCode Available | 1 | 5 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes | Feb 1, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 | 5 |
| RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D | Aug 23, 2023 | ObjectObject Tracking | CodeCode Available | 1 | 5 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 | 5 |
| Talk2Car: Taking Control of Your Self-Driving Car | Sep 24, 2019 | Autonomous DrivingObject | CodeCode Available | 1 | 5 |
| Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | May 21, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 | 5 |
| TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation | Oct 19, 2022 | Instance SegmentationReferring Expression | CodeCode Available | 1 | 5 |
| TransVG: End-to-End Visual Grounding with Transformers | Apr 17, 2021 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 1 | 5 |
| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| Tune-An-Ellipse: CLIP Has Potential to Find What You Want | Jan 1, 2024 | ObjectReferring Expression | CodeCode Available | 1 | 5 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 | 5 |
| Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE | Sep 26, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 | 5 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 | 5 |
| VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment | Oct 9, 2022 | object-detectionObject Detection | CodeCode Available | 1 | 5 |
| A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension | Apr 17, 2022 | Data AugmentationReferring Expression | CodeCode Available | 1 | 5 |
| Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions | Nov 28, 2023 | DisentanglementReferring Expression | CodeCode Available | 1 | 5 |
| Towards Language-guided Visual Recognition via Dynamic Convolutions | Oct 17, 2021 | Question AnsweringReferring Expression | CodeCode Available | 0 | 5 |
| Continual Referring Expression Comprehension via Dual Modular Memorization | Nov 25, 2023 | MemorizationReferring Expression | CodeCode Available | 0 | 5 |
| Revisiting Counterfactual Problems in Referring Expression Comprehension | Jan 1, 2024 | AttributeContrastive Learning | CodeCode Available | 0 | 5 |
| OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | Feb 7, 2022 | Image Captioningimage-classification | CodeCode Available | 0 | 5 |
| CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions | Jan 3, 2019 | DiagnosticImage Segmentation | CodeCode Available | 0 | 5 |
| Collecting Visually-Grounded Dialogue with A Game Of Sorts | Sep 10, 2023 | Coreference ResolutionImage Retrieval | CodeCode Available | 0 | 5 |
| Scene-Text Oriented Reffering Expression Comprehension | Nov 4, 2022 | Object LocalizationReferring Expression | CodeCode Available | 0 | 5 |
| Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge | Jun 2, 2020 | 16kReferring Expression | CodeCode Available | 0 | 5 |
| MAttNet: Modular Attention Network for Referring Expression Comprehension | Jan 24, 2018 | Generalized Referring Expression SegmentationReferring Expression | CodeCode Available | 0 | 5 |
| Cosine meets Softmax: A tough-to-beat baseline for visual grounding | Sep 13, 2020 | Autonomous DrivingMetric Learning | CodeCode Available | 0 | 5 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | CodeCode Available | 0 | 5 |
| Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models | Nov 24, 2023 | AllReferring Expression | CodeCode Available | 0 | 5 |
| Adversarial Robustness for Visual Grounding of Multimodal Large Language Models | May 16, 2024 | Adversarial AttackAdversarial Robustness | CodeCode Available | 0 | 5 |
| CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension | Feb 17, 2023 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training | Aug 20, 2024 | Autonomous VehiclesComputational Efficiency | CodeCode Available | 0 | 5 |
| HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks | Aug 24, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| A Joint Speaker-Listener-Reinforcer Model for Referring Expressions | Dec 30, 2016 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Whether you can locate or not? Interactive Referring Expression Generation | Aug 19, 2023 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models | Nov 21, 2023 | Image SegmentationLanguage Modelling | CodeCode Available | 0 | 5 |
| A Real-time Global Inference Network for One-stage Referring Expression Comprehension | Dec 7, 2019 | Diversityfeature selection | CodeCode Available | 0 | 5 |
| Language Adaptive Weight Generation for Multi-task Visual Grounding | Jun 6, 2023 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Language-Conditioned Feature Pyramids for Visual Selection Tasks | Nov 1, 2020 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Language-Conditioned Graph Networks for Relational Reasoning | May 10, 2019 | ObjectReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Understanding Synonymous Referring Expressions via Contrastive Features | Apr 20, 2021 | ObjectReferring Expression | CodeCode Available | 0 | 5 |
| Referring Expression Comprehension Using Language Adaptive Inference | Jun 6, 2023 | object-detectionObject Detection | CodeCode Available | 0 | 5 |
| WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation | May 24, 2025 | Contrastive LearningReferring Expression | CodeCode Available | 0 | 5 |
| Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation | May 24, 2021 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 | 5 |
| Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos | Sep 21, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 | 5 |
| AttnGrounder: Talking to Cars with Attention | Sep 11, 2020 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 0 | 5 |
| Natural Language Object Retrieval | Nov 13, 2015 | Image CaptioningImage Retrieval | CodeCode Available | 0 | 5 |