| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 | 5 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | Feb 14, 2023 | DecoderImage Segmentation | CodeCode Available | 1 | 5 |
| New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration | Feb 27, 2025 | Image ComprehensionReferring Expression | CodeCode Available | 1 | 5 |
| NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | Mar 23, 2023 | Question AnsweringReferring Expression | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| An Open and Comprehensive Pipeline for Unified Object Grounding and Detection | Jan 4, 2024 | Described Object DetectionPhrase Grounding | CodeCode Available | 1 | 5 |
| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| InstructDET: Diversifying Referring Object Detection with Generalized Instructions | Oct 8, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 | 5 |