VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding Jun 18, 2024 Image Captioning Question Answering
Code Code Available 2The Revolution of Multimodal Large Language Models: A Survey Feb 19, 2024 Image Generation Instruction Following
Code Code Available 2InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition May 21, 2025 Earth Observation Object
Code Code Available 2Interpreting Object-level Foundation Models via Visual Precision Search Nov 25, 2024 Explainable Artificial Intelligence (XAI) Object
Code Code Available 2SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Jul 3, 2024 object-detection Object Detection
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2Aligning and Prompting Everything All at Once for Universal Visual Perception Dec 4, 2023 All Object
Code Code Available 2Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Jun 9, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 2Referring Image Matting Jun 10, 2022 Domain Generalization Image Matting
Code Code Available 2F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 2List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Apr 25, 2024 Visual Grounding Visual Question Answering
Code Code Available 2BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Jul 17, 2023 Instruction Following Sentence
Code Code Available 2RefMask3D: Language-Guided Transformer for 3D Referring Segmentation Jul 25, 2024 3D visual grounding Image Segmentation
Code Code Available 2NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 2One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Dec 28, 2023 All Anatomy
Code Code Available 2ChatterBox: Multi-round Multimodal Referring and Grounding Jan 24, 2024 Language Modeling Language Modelling
Code Code Available 2DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Jun 30, 2025 Caption Generation Object
Code Code Available 2MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 2Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Jan 1, 2025 Hallucination Response Generation
Code Code Available 2Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory Mar 19, 2024 Adversarial Text Diversity
Code Code Available 1Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 1Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling Mar 21, 2024 Grounded language learning Language Acquisition
Code Code Available 1LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 1Learning Cross-modal Context Graph for Visual Grounding Nov 20, 2019 Graph Matching Graph Neural Network
Code Code Available 1Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions Feb 17, 2024 Visual Grounding
Code Code Available 1Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter Nov 9, 2023 Object Visual Grounding
Code Code Available 1Learning Cross-modal Context Graph for Visual Grounding Feb 13, 2020 Graph Matching Graph Neural Network
Code Code Available 1Local-Global Context Aware Transformer for Language-Guided Video Segmentation Mar 18, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 1Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision Jul 23, 2023 Decoder Visual Grounding
Code Code Available 1Instruction-Following Agents with Multimodal Transformer Oct 24, 2022 Instruction Following Visual Grounding
Code Code Available 1DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding Nov 28, 2022 object-detection Object Detection
Code Code Available 1Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 1Joint Visual Grounding and Tracking with Natural Language Specification Mar 21, 2023 Visual Grounding Visual Tracking
Code Code Available 1Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation Jul 3, 2020 Contrastive Learning Knowledge Distillation
Code Code Available 1InfMLLM: A Unified Framework for Visual-Language Tasks Nov 12, 2023 GPU Image Captioning
Code Code Available 1Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations Jun 30, 2022 Language Modeling Language Modelling
Code Code Available 1Improving One-stage Visual Grounding by Recursive Sub-query Construction Aug 3, 2020 Sentence Sentence Embedding
Code Code Available 1Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning Apr 30, 2022 Attribute Decoder
Code Code Available 1InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring Mar 1, 2021 3D visual grounding Attribute
Code Code Available 1Kosmos-2: Grounding Multimodal Large Language Models to the World Jun 26, 2023 Image Captioning In-Context Learning
Code Code Available 1Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding Nov 25, 2022 3D visual grounding Knowledge Distillation
Code Code Available 1CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 1UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Nov 23, 2021 Image Captioning Image Description
Code Code Available 1A Unified Framework for 3D Point Cloud Visual Grounding Aug 23, 2023 CPU GPU
Code Code Available 1Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans May 23, 2023 3D Reconstruction 3D visual grounding
Code Code Available 1IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities Aug 23, 2024 Language Modeling Language Modelling
Code Code Available 1Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation Apr 5, 2021 Object Visual Grounding
Code Code Available 1An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding Aug 2, 2024 Decoder Reasoning Segmentation
Code Code Available 1CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models Sep 24, 2021 Visual Grounding
Code Code Available 1A Fast and Accurate One-Stage Approach to Visual Grounding Aug 18, 2019 Referring Expression Referring Expression Comprehension
Code Code Available 1