VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Apr 10, 2025 Language Modeling Language Modelling
Code Code Available 95 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Dec 13, 2024 Chart Understanding Mixture-of-Experts
Code Code Available 95 MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Oct 14, 2023 Image Classification Image Description
Code Code Available 75 Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Jan 7, 2025 Autonomous Driving General Knowledge
Code Code Available 55 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Jun 24, 2024 Representation Learning Visual Grounding
Code Code Available 55 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 55 Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Apr 19, 2024 Language Modeling Language Modelling
Code Code Available 45 V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs Jan 1, 2024 Visual Grounding World Knowledge
Code Code Available 45 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Oct 17, 2023 Interactive Segmentation Referring Expression
Code Code Available 45 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 45 MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Jun 13, 2024 3D visual grounding Attribute
Code Code Available 45 OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics May 23, 2025 Chart Understanding object-detection
Code Code Available 35 Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Oct 7, 2024 Natural Language Visual Grounding Navigate
Code Code Available 35 AgentStudio: A Toolkit for Building General Virtual Agents Mar 26, 2024 Visual Grounding
Code Code Available 35 DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models Jun 17, 2024 Document Classification Visual Grounding
Code Code Available 35 Champion Solution for the WSDM2023 Toloka VQA Challenge Jan 22, 2023 Question Answering Visual Grounding
Code Code Available 35 A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions Jun 9, 2024 3D visual grounding Survey
Code Code Available 35 Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 35 Aria-UI: Visual Grounding for GUI Instructions Dec 20, 2024 Natural Language Visual Grounding Visual Grounding
Code Code Available 35 Towards Visual Grounding: A Survey Dec 28, 2024 Phrase Grounding Referring Expression
Code Code Available 35 ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Feb 27, 2024 3D geometry 3D Object Captioning
Code Code Available 35 BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence Nov 22, 2024 3D visual grounding Visual Grounding
Code Code Available 35 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Jan 11, 2024 Representation Learning Self-Supervised Learning
Code Code Available 35 Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Feb 14, 2025 3D Object Detection 3D visual grounding
Code Code Available 35 Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning May 18, 2025 Reinforcement Learning (RL) Visual Grounding
Code Code Available 35 RefMask3D: Language-Guided Transformer for 3D Referring Segmentation Jul 25, 2024 3D visual grounding Image Segmentation
Code Code Available 25 Aligning and Prompting Everything All at Once for Universal Visual Perception Dec 4, 2023 All Object
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 Reasoning to Attend: Try to Understand How <SEG> Token Works Dec 23, 2024 Semantic Similarity Semantic Textual Similarity
Code Code Available 25 Referring Image Matting Jun 10, 2022 Domain Generalization Image Matting
Code Code Available 25 NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 25 AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Jun 18, 2024 Object Response Generation
Code Code Available 25 DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Jun 30, 2025 Caption Generation Object
Code Code Available 25 One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Dec 28, 2023 All Anatomy
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent Sep 21, 2023 3D visual grounding Language Modeling
Code Code Available 25 ChatterBox: Multi-round Multimodal Referring and Grounding Jan 24, 2024 Language Modeling Language Modelling
Code Code Available 25 MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 25 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition May 21, 2025 Earth Observation Object
Code Code Available 25 In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation Aug 9, 2024 Image to text Object
Code Code Available 25 Interpreting Object-level Foundation Models via Visual Precision Search Nov 25, 2024 Explainable Artificial Intelligence (XAI) Object
Code Code Available 25 BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Jul 17, 2023 Instruction Following Sentence
Code Code Available 25 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment Aug 8, 2023 3D Question Answering (3D-QA) Dense Captioning
Code Code Available 25 DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding Mar 17, 2025 Domain Generalization Multimodal Reasoning
Code Code Available 25 Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Sep 5, 2024 Question Answering Scene Understanding
Code Code Available 25 List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Apr 25, 2024 Visual Grounding Visual Question Answering
Code Code Available 25 VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 25 A Simple Aerial Detection Baseline of Multimodal Language Models Jan 16, 2025 object-detection Object Detection
Code Code Available 25 High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning Jul 8, 2025 MME Reinforcement Learning (RL)
Code Code Available 25 GTA1: GUI Test-time Scaling Agent Jul 8, 2025 Reinforcement Learning (RL) Task Planning
Code Code Available 25