| PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models | May 23, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | Feb 14, 2023 | DecoderImage Segmentation | CodeCode Available | 1 | 5 |
| Large-Scale Adversarial Training for Vision-and-Language Representation Learning | Jun 11, 2020 | Image-text RetrievalQuestion Answering | CodeCode Available | 1 | 5 |
| Multi-branch Collaborative Learning Network for 3D Visual Grounding | Jul 7, 2024 | 3D visual groundingReferring Expression | CodeCode Available | 1 | 5 |
| An Open and Comprehensive Pipeline for Unified Object Grounding and Detection | Jan 4, 2024 | Described Object DetectionPhrase Grounding | CodeCode Available | 1 | 5 |
| MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding | Apr 26, 2021 | Generalized Referring Expression ComprehensionPhrase Grounding | CodeCode Available | 1 | 5 |
| Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | Jun 30, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| InstructDET: Diversifying Referring Object Detection with Generalized Instructions | Oct 8, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Kosmos-2: Grounding Multimodal Large Language Models to the World | Jun 26, 2023 | Image CaptioningIn-Context Learning | CodeCode Available | 1 | 5 |
| ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension | Apr 12, 2022 | image-classificationImage Classification | CodeCode Available | 1 | 5 |