| Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input | Jun 25, 2023 | DiversityImage-text Retrieval | —Unverified | 0 |
| Exploring Spatial Language Grounding Through Referring Expressions | Feb 4, 2025 | Image CaptioningNegation | —Unverified | 0 |
| FindIt: Generalized Localization with Natural Language Queries | Mar 31, 2022 | Natural Language QueriesObject | —Unverified | 0 |
| Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks | Jul 14, 2023 | ObjectReferring Expression | —Unverified | 0 |
| FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis | Jan 17, 2025 | Bayesian InferenceLanguage Modeling | —Unverified | 0 |
| Deep Fragment Embeddings for Bidirectional Image Sentence Mapping | Jun 22, 2014 | Referring Expression ComprehensionRetrieval | —Unverified | 0 |
| CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding | Nov 6, 2023 | CoLAQuestion Answering | —Unverified | 0 |
| Synthetic Visual Genome | Jun 9, 2025 | Referring ExpressionReferring Expression Comprehension | —Unverified | 0 |
| GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing | Mar 16, 2025 | Change DetectionImage Captioning | —Unverified | 0 |
| Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding | Mar 19, 2020 | ObjectReferring Expression Comprehension | —Unverified | 0 |