| Webly Supervised Concept Expansion for General Purpose Vision Models | Feb 4, 2022 | Human-Object Interaction DetectionImage Retrieval | —Unverified | 0 |
| MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension | Mar 18, 2020 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks | Dec 12, 2018 | Graph AttentionObject | —Unverified | 0 |
| Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks | Jun 17, 2022 | Depth EstimationImage Generation | —Unverified | 0 |
| Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks | Jan 14, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning | Jul 31, 2022 | AllReferring Expression | —Unverified | 0 |
| Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos | Mar 23, 2021 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries | Nov 17, 2017 | ObjectObject Discovery | —Unverified | 0 |
| Playing Lottery Tickets with Vision and Language | Apr 23, 2021 | Image-text RetrievalQuestion Answering | —Unverified | 0 |
| VQD: Visual Query Detection in Natural Scenes | Apr 4, 2019 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |