| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | Feb 14, 2023 | DecoderImage Segmentation | CodeCode Available | 1 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes | Feb 1, 2025 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D | Aug 23, 2023 | ObjectObject Tracking | CodeCode Available | 1 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 |
| Talk2Car: Taking Control of Your Self-Driving Car | Sep 24, 2019 | Autonomous DrivingObject | CodeCode Available | 1 |
| Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | May 21, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 |
| TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation | Oct 19, 2022 | Instance SegmentationReferring Expression | CodeCode Available | 1 |
| TransVG: End-to-End Visual Grounding with Transformers | Apr 17, 2021 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 1 |
| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Tune-An-Ellipse: CLIP Has Potential to Find What You Want | Jan 1, 2024 | ObjectReferring Expression | CodeCode Available | 1 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE | Sep 26, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | Aug 6, 2019 | Image RetrievalQuestion Answering | CodeCode Available | 1 |
| VL-BERT: Pre-training of Generic Visual-Linguistic Representations | Aug 22, 2019 | Image-text matchingLanguage Modelling | CodeCode Available | 1 |
| VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment | Oct 9, 2022 | object-detectionObject Detection | CodeCode Available | 1 |
| A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension | Apr 17, 2022 | Data AugmentationReferring Expression | CodeCode Available | 1 |
| Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions | Nov 28, 2023 | DisentanglementReferring Expression | CodeCode Available | 1 |
| Learning Visual Grounding from Generative Vision and Language Model | Jul 18, 2024 | AttributeLanguage Modeling | —Unverified | 0 |
| Revisiting Multi-Modal LLM Evaluation | Aug 9, 2024 | Chart UnderstandingOptical Character Recognition | —Unverified | 0 |
| ScanFormer: Referring Expression Comprehension by Iteratively Scanning | Jun 26, 2024 | InformativenessReferring Expression | —Unverified | 0 |
| Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO | Jun 27, 2024 | Image SegmentationMedical Image Segmentation | —Unverified | 0 |
| Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding | Mar 25, 2025 | AttributeObject | —Unverified | 0 |
| Dynamic Graph Attention for Referring Expression Comprehension | Sep 18, 2019 | Graph AttentionReferring Expression | —Unverified | 0 |
| Dynamic Inference With Grounding Based Vision and Language Models | Jan 1, 2023 | Language ModellingReferring Expression | —Unverified | 0 |
| DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension | Jan 1, 2025 | DescriptiveReferring Expression | —Unverified | 0 |
| Differentiated Relevances Embedding for Group-based Referring Expression Comprehension | Mar 12, 2022 | AttributeObject | —Unverified | 0 |
| ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph | Jun 30, 2020 | AttributePrediction | —Unverified | 0 |
| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| FindIt: Generalized Localization with Natural Language Queries | Mar 31, 2022 | Natural Language QueriesObject | —Unverified | 0 |
| Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks | Jul 14, 2023 | ObjectReferring Expression | —Unverified | 0 |
| FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis | Jan 17, 2025 | Bayesian InferenceLanguage Modeling | —Unverified | 0 |
| Deep Fragment Embeddings for Bidirectional Image Sentence Mapping | Jun 22, 2014 | Referring Expression ComprehensionRetrieval | —Unverified | 0 |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | Nov 6, 2023 | CoLAQuestion Answering | —Unverified | 0 |
| Synthetic Visual Genome | Jun 9, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding | Mar 19, 2020 | ObjectReferring Expression Comprehension | —Unverified | 0 |
| Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension | Mar 1, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Harlequin: Color-driven Generation of Synthetic Data for Referring Expression Comprehension | Nov 22, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension | Jan 2, 2025 | Generalized Referring Expression ComprehensionGeneralized Referring Expression Segmentation | —Unverified | 0 |
| Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training | Mar 4, 2024 | MathPhrase Grounding | —Unverified | 0 |
| Video Referring Expression Comprehension via Transformer with Content-conditioned Query | Oct 25, 2023 | cross-modal alignmentReferring Expression | —Unverified | 0 |
| Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding | Jan 1, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving | May 25, 2023 | 3D Object DetectionAutonomous Driving | —Unverified | 0 |
| Language-Mediated, Object-Centric Representation Learning | Dec 31, 2020 | ObjectObject Discovery | —Unverified | 0 |
| Text-driven Affordance Learning from Egocentric Vision | Apr 3, 2024 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection | Dec 4, 2023 | Image to textobject-detection | —Unverified | 0 |