| MDETR - Modulated Detection for End-to-End Multi-Modal Understanding | Jan 1, 2021 | Phrase GroundingQuestion Answering | CodeCode Available | 2 |
| Frontiers in Intelligent Colonoscopy | Oct 22, 2024 | Image Captioning | CodeCode Available | 2 |
| LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | Feb 15, 2024 | Grounded Multimodal Named Entity RecognitionMulti-modal Named Entity Recognition | CodeCode Available | 1 |
| Learning to Evaluate Performance of Multi-modal Semantic Localization | Sep 14, 2022 | Cross-Modal RetrievalReferring Expression | CodeCode Available | 1 |
| LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension | Sep 18, 2024 | Referring ExpressionReferring Expression Comprehension | CodeCode Available | 1 |
| Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone | Jun 15, 2022 | Described Object DetectionImage Captioning | CodeCode Available | 1 |
| Described Object Detection: Liberating Object Detection with Flexible Expressions | Jul 24, 2023 | Binary ClassificationDescribed Object Detection | CodeCode Available | 1 |
| DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM | Mar 19, 2024 | Objectobject-detection | CodeCode Available | 1 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 |
| Explainable Neural Computation via Stack Neural Module Networks | Jul 23, 2018 | Decision MakingQuestion Answering | CodeCode Available | 1 |