| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension | Apr 17, 2022 | Data AugmentationReferring Expression | CodeCode Available | 1 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| SeqTR: A Simple yet Universal Network for Visual Grounding | Mar 30, 2022 | DecoderReferring Expression | CodeCode Available | 1 |
| Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds | Dec 16, 2021 | Objectobject-detection | CodeCode Available | 1 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 |
| TransVG: End-to-End Visual Grounding with Transformers | Apr 17, 2021 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 1 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation | Mar 19, 2020 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Talk2Car: Taking Control of Your Self-Driving Car | Sep 24, 2019 | Autonomous DrivingObject | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |
| A Fast and Accurate One-Stage Approach to Visual Grounding | Aug 18, 2019 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| Explainable Neural Computation via Stack Neural Module Networks | Jul 23, 2018 | Decision MakingQuestion Answering | CodeCode Available | 1 |
| Compositional Attention Networks for Machine Reasoning | Mar 8, 2018 | Referring Expression ComprehensionVisual Question Answering (VQA) | CodeCode Available | 1 |
| Referring Expression Instance Retrieval and A Strong End-to-End Baseline | Jun 23, 2025 | Image RetrievalReferring Expression | —Unverified | 0 |
| Synthetic Visual Genome | Jun 9, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation | May 24, 2025 | Contrastive LearningReferring Expression | CodeCode Available | 0 |
| Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding | Mar 25, 2025 | AttributeObject | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis | Jan 17, 2025 | Bayesian InferenceLanguage Modeling | —Unverified | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension | Jan 2, 2025 | Generalized Referring Expression ComprehensionGeneralized Referring Expression Segmentation | —Unverified | 0 |
| DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension | Jan 1, 2025 | DescriptiveReferring Expression | —Unverified | 0 |
| Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding | Jan 1, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension | Nov 22, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models | Oct 21, 2024 | Instruction Followingobject-detection | CodeCode Available | 0 |
| Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression | Sep 5, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| A Lightweight Modular Framework for Low-Cost Open-Vocabulary Object Detection Training | Aug 20, 2024 | Autonomous VehiclesComputational Efficiency | CodeCode Available | 0 |
| Revisiting Multi-Modal LLM Evaluation | Aug 9, 2024 | Chart UnderstandingOptical Character Recognition | —Unverified | 0 |
| MaskInversion: Localized Embeddings via Optimization of Explainability Maps | Jul 29, 2024 | Image GenerationReferring Expression | —Unverified | 0 |
| Learning Visual Grounding from Generative Vision and Language Model | Jul 18, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| The Solution for the 5th GCAIAC Zero-shot Referring Expression Comprehension Challenge | Jul 6, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| M^2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension | Jul 1, 2024 | GPUReferring Expression | —Unverified | 0 |
| Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO | Jun 27, 2024 | Image SegmentationMedical Image Segmentation | —Unverified | 0 |
| ScanFormer: Referring Expression Comprehension by Iteratively Scanning | Jun 26, 2024 | InformativenessReferring Expression | —Unverified | 0 |
| Adversarial Robustness for Visual Grounding of Multimodal Large Language Models | May 16, 2024 | Adversarial AttackAdversarial Robustness | CodeCode Available | 0 |
| Text-driven Affordance Learning from Egocentric Vision | Apr 3, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| PropTest: Automatic Property Testing for Improved Visual Programming | Mar 25, 2024 | Question AnsweringReferring Expression | —Unverified | 0 |
| WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar | Mar 19, 2024 | Autonomous NavigationReferring Expression | —Unverified | 0 |
| Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training | Mar 4, 2024 | MathPhrase Grounding | —Unverified | 0 |
| Revisiting Counterfactual Problems in Referring Expression Comprehension | Jan 1, 2024 | AttributeContrastive Learning | CodeCode Available | 0 |
| Compositional Zero-Shot Learning for Attribute-Based Object Reference in Human-Robot Interaction | Dec 21, 2023 | 16kAttribute | —Unverified | 0 |
| Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects | Dec 8, 2023 | Image Captioningobject-detection | —Unverified | 0 |