| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks | Jun 17, 2022 | Depth EstimationImage Generation | —Unverified | 0 |
| RefCrowd: Grounding the Target in Crowd with Referring Expressions | Jun 16, 2022 | AttributeReferring Expression | —Unverified | 0 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension | Apr 21, 2022 | DiversityInformativeness | —Unverified | 0 |
| A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension | Apr 17, 2022 | Data AugmentationReferring Expression | CodeCode Available | 1 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 |
| FindIt: Generalized Localization with Natural Language Queries | Mar 31, 2022 | Natural Language QueriesObject | —Unverified | 0 |
| SeqTR: A Simple yet Universal Network for Visual Grounding | Mar 30, 2022 | DecoderReferring Expression | CodeCode Available | 1 |
| Differentiated Relevances Embedding for Group-based Referring Expression Comprehension | Mar 12, 2022 | AttributeObject | —Unverified | 0 |
| OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework | Feb 7, 2022 | Image Captioningimage-classification | CodeCode Available | 0 |
| Webly Supervised Concept Expansion for General Purpose Vision Models | Feb 4, 2022 | Human-Object Interaction DetectionImage Retrieval | —Unverified | 0 |
| Lite-MDETR: A Lightweight Multi-Modal Detector | Jan 1, 2022 | object-detectionObject Detection | —Unverified | 0 |
| Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds | Dec 16, 2021 | Objectobject-detection | CodeCode Available | 1 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Nov 16, 2021 | image-classificationImage Classification | —Unverified | 0 |
| Evaluating and Improving Interactions with Hazy Oracles | Oct 19, 2021 | Object TrackingReferring Expression | —Unverified | 0 |
| Towards Language-guided Visual Recognition via Dynamic Convolutions | Oct 17, 2021 | Question AnsweringReferring Expression | CodeCode Available | 0 |
| Referring Transformer: A One-step Approach to Multi-task Visual Grounding | Jun 6, 2021 | DecoderReferring Expression | CodeCode Available | 1 |
| Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation | May 24, 2021 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 |
| Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention | May 5, 2021 | Question AnsweringReferring Expression | —Unverified | 0 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| Understanding Synonymous Referring Expressions via Contrastive Features | Apr 20, 2021 | ObjectReferring Expression | CodeCode Available | 0 |
| TransVG: End-to-End Visual Grounding with Transformers | Apr 17, 2021 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 1 |
| Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos | Mar 23, 2021 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Unifying Vision-and-Language Tasks via Text Generation | Feb 4, 2021 | Conditional Text GenerationDecoder | CodeCode Available | 1 |
| MDETR - Modulated Detection for End-to-End Multi-Modal Understanding | Jan 1, 2021 | Phrase GroundingQuestion Answering | CodeCode Available | 2 |
| TRAR: Routing the Attention Spans in Transformer for Visual Question Answering | Jan 1, 2021 | Question AnsweringReferring Expression | CodeCode Available | 1 |
| Language-Mediated, Object-Centric Representation Learning | Dec 31, 2020 | ObjectObject Discovery | —Unverified | 0 |
| PPGN: Phrase-Guided Proposal Generation Network For Referring Expression Comprehension | Dec 20, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Modular Graph Attention Network for Complex Visual Relational Reasoning | Nov 22, 2020 | Graph AttentionQuestion Answering | —Unverified | 0 |
| ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments | Nov 15, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Language-Conditioned Feature Pyramids for Visual Selection Tasks | Nov 1, 2020 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 0 |
| Commands 4 Autonomous Vehicles (C4AV) Workshop Summary | Sep 18, 2020 | Autonomous VehiclesReferring Expression Comprehension | —Unverified | 0 |
| Cosine meets Softmax: A tough-to-beat baseline for visual grounding | Sep 13, 2020 | Autonomous DrivingMetric Learning | CodeCode Available | 0 |
| AttnGrounder: Talking to Cars with Attention | Sep 11, 2020 | Referring Expression ComprehensionVisual Grounding | CodeCode Available | 0 |
| Referring Expression Comprehension: A Survey of Methods and Datasets | Jul 19, 2020 | object-detectionObject Detection | —Unverified | 0 |
| ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph | Jun 30, 2020 | AttributePrediction | —Unverified | 0 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 |
| Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge | Jun 2, 2020 | 16kReferring Expression | CodeCode Available | 0 |
| Leveraging Non-Specialists for Accurate and Time Efficient AMR Annotation | May 1, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding | Mar 19, 2020 | ObjectReferring Expression Comprehension | —Unverified | 0 |
| Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation | Mar 19, 2020 | Generalized Referring Expression ComprehensionReferring Expression | CodeCode Available | 1 |
| MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension | Mar 18, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension | Mar 1, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| A Real-time Global Inference Network for One-stage Referring Expression Comprehension | Dec 7, 2019 | Diversityfeature selection | CodeCode Available | 0 |
| UNITER: Learning UNiversal Image-TExt Representations | Sep 25, 2019 | Image-text matchingImage-text Retrieval | —Unverified | 0 |
| UNITER: UNiversal Image-TExt Representation Learning | Sep 25, 2019 | Image-text matchingImage-text Retrieval | CodeCode Available | 1 |
| Talk2Car: Taking Control of Your Self-Driving Car | Sep 24, 2019 | Autonomous DrivingObject | CodeCode Available | 1 |