F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 2Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Jun 9, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 2List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Apr 25, 2024 Visual Grounding Visual Question Answering
Code Code Available 2HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding Apr 20, 2024 cross-modal alignment Visual Grounding
Code Code Available 2VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 2MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 2The Revolution of Multimodal Large Language Models: A Survey Feb 19, 2024 Image Generation Instruction Following
Code Code Available 2ChatterBox: Multi-round Multimodal Referring and Grounding Jan 24, 2024 Language Modeling Language Modelling
Code Code Available 2SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Jan 18, 2024 Instruction Following Language Modeling
Code Code Available 2Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation Jan 1, 2024 Descriptive Object
Code Code Available 2One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Dec 28, 2023 All Anatomy
Code Code Available 2Aligning and Prompting Everything All at Once for Universal Visual Perception Dec 4, 2023 All Object
Code Code Available 2NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 2From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Oct 13, 2023 Hallucination Image Captioning
Code Code Available 2LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent Sep 21, 2023 3D visual grounding Language Modeling
Code Code Available 23D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Aug 8, 2023 3D Question Answering (3D-QA) Dense Captioning
Code Code Available 2BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Jul 17, 2023 Instruction Following Sentence
Code Code Available 2X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Nov 22, 2022 All Cross-Modal Retrieval
Code Code Available 2Referring Image Matting Jun 10, 2022 Domain Generalization Image Matting
Code Code Available 2Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Jun 11, 2025 Hallucination Object Hallucination
Code Code Available 1GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents May 21, 2025 Answer Generation Reinforcement Learning (RL)
Code Code Available 1Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving May 13, 2025 3D visual grounding Autonomous Driving
Code Code Available 1STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Apr 3, 2025 Instruction Following Language Modeling
Code Code Available 1RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning Mar 29, 2025 Chart Question Answering Chart Understanding
Code Code Available 1Visual Position Prompt for MLLM based Visual Grounding Mar 19, 2025 Position Visual Grounding
Code Code Available 1How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game Mar 13, 2025 Multimodal Reasoning Question Answering
Code Code Available 1SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding Feb 24, 2025 cross-modal alignment Visual Grounding
Code Code Available 1Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection Feb 3, 2025 3D visual grounding Visual Grounding
Code Code Available 1NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning Feb 1, 2025 Referring Expression Visual Grounding
Code Code Available 1PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model Jan 21, 2025 Hallucination Image Captioning
Code Code Available 1When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis Jan 17, 2025 Large Language Model Multimodal Large Language Model
Code Code Available 1Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints Jan 12, 2025 Image Segmentation Referring Expression
Code Code Available 1Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs Jan 11, 2025 Math Mathematical Problem-Solving
Code Code Available 1Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems Nov 21, 2024 3D visual grounding Negation
Code Code Available 1VividMed: Vision Language Model with Versatile Visual Grounding for Medicine Oct 16, 2024 Language Modeling Language Modelling
Code Code Available 1Visual Grounding with Multi-modal Conditional Adaptation Sep 8, 2024 object-detection Object Detection
Code Code Available 1IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities Aug 23, 2024 Language Modeling Language Modelling
Code Code Available 1Visual Grounding for Object-Level Generalization in Reinforcement Learning Aug 4, 2024 Language Modelling Object
Code Code Available 1An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding Aug 2, 2024 Decoder Reasoning Segmentation
Code Code Available 13D Vision and Language Pretraining with Large-Scale Synthetic Data Jul 8, 2024 Dense Captioning Diversity
Code Code Available 1Multi-branch Collaborative Learning Network for 3D Visual Grounding Jul 7, 2024 3D visual grounding Referring Expression
Code Code Available 1CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 1Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation Jun 11, 2024 Grounded Multimodal Named Entity Recognition named-entity-recognition
Code Code Available 1Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 1Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension May 21, 2024 3D visual grounding Referring Expression
Code Code Available 1DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding May 10, 2024 Relation Spatial Reasoning
Code Code Available 1Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling Mar 21, 2024 Grounded language learning Language Acquisition
Code Code Available 1Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory Mar 19, 2024 Adversarial Text Diversity
Code Code Available 1HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 1MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding Mar 5, 2024 3D visual grounding Decision Making
Code Code Available 1