Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Jan 7, 2025 Autonomous Driving General Knowledge
Code Code Available 5EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models Jan 6, 2025 Hallucination Visual Grounding
— Unverified 0ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding Jan 2, 2025 3D visual grounding Diagnostic
— Unverified 0Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes Jan 1, 2025 Cross-Modal Retrieval Disentanglement
— Unverified 0Beyond Human Perception: Understanding Multi-Object World from Monocular View Jan 1, 2025 3D visual grounding Denoising
Code Code Available 0VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Jan 1, 2025 Large Language Model Video Segmentation
— Unverified 0Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding Jan 1, 2025 3D visual grounding Data Augmentation
Code Code Available 0Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Jan 1, 2025 Hallucination Response Generation
Code Code Available 2Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding Jan 1, 2025 Referring Expression Referring Expression Comprehension
— Unverified 0Towards Visual Grounding: A Survey Dec 28, 2024 Phrase Grounding Referring Expression
Code Code Available 3Referencing Where to Focus: Improving VisualGrounding with Referential Query Dec 26, 2024 Decoder Visual Grounding
— Unverified 0Reasoning to Attend: Try to Understand How <SEG> Token Works Dec 23, 2024 Semantic Similarity Semantic Textual Similarity
Code Code Available 2CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models Dec 22, 2024 Language Modeling Language Modelling
Code Code Available 0Aria-UI: Visual Grounding for GUI Instructions Dec 20, 2024 Natural Language Visual Grounding Visual Grounding
Code Code Available 3FiVL: A Framework for Improved Vision-Language Alignment Dec 19, 2024 Answer Generation Multimodal Reasoning
Code Code Available 0EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Dec 19, 2024 Change Detection Disaster Response
— Unverified 0GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting Dec 18, 2024 Scene Understanding Semantic Segmentation
— Unverified 0DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Dec 13, 2024 Chart Understanding Mixture-of-Experts
Code Code Available 9Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses Dec 11, 2024 Image-text Retrieval Question Answering
— Unverified 0Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models Dec 11, 2024 Question Answering Visual Grounding
Code Code Available 03D Spatial Understanding in MLLMs: Disambiguation and Evaluation Dec 9, 2024 3D dense captioning 3D visual grounding
— Unverified 0TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action Dec 7, 2024 Depth Estimation Mathematical Reasoning
Code Code Available 2Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Dec 6, 2024 document understanding Hallucination
Code Code Available 0M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction Dec 5, 2024 Relation Extraction Visual Grounding
Code Code Available 0SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding Dec 5, 2024 3D visual grounding Object Localization
— Unverified 0Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding Dec 1, 2024 Visual Grounding
— Unverified 03D Scene Graph Guided Vision-Language Pre-training Nov 27, 2024 3D dense captioning 3D visual grounding
— Unverified 0Interpreting Object-level Foundation Models via Visual Precision Search Nov 25, 2024 Explainable Artificial Intelligence (XAI) Object
Code Code Available 2BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence Nov 22, 2024 3D visual grounding Visual Grounding
Code Code Available 3Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems Nov 21, 2024 3D visual grounding Negation
Code Code Available 1Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset Nov 21, 2024 Question Answering Visual Grounding
Code Code Available 0GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding Nov 16, 2024 Instruction Following Language Modeling
Code Code Available 2Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level Nov 15, 2024 Benchmarking counterfactual
— Unverified 0VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Nov 7, 2024 Decoder Language Modeling
— Unverified 0LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers Nov 7, 2024 3D visual grounding Autonomous Driving
— Unverified 0Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding Nov 5, 2024 3D visual grounding Visual Grounding
— Unverified 0Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction for Visual Grounding Oct 31, 2024 Object Position
Code Code Available 0Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding Oct 31, 2024 parameter-efficient fine-tuning Visual Grounding
— Unverified 0Few-Shot Multimodal Explanation for Visual Question Answering Oct 28, 2024 Explainable artificial intelligence Explainable Artificial Intelligence (XAI)
Code Code Available 0Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models Oct 21, 2024 Instruction Following object-detection
Code Code Available 0Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding Oct 21, 2024 3D visual grounding Object
— Unverified 0VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding Oct 17, 2024 3D geometry 3D visual grounding
Code Code Available 2VividMed: Vision Language Model with Versatile Visual Grounding for Medicine Oct 16, 2024 Language Modeling Language Modelling
Code Code Available 1MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs Oct 16, 2024 Visual Grounding
Code Code Available 0Context-Infused Visual Grounding for Art Oct 16, 2024 object-detection Object Detection
Code Code Available 0VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI Oct 15, 2024 Question Answering Video Question Answering
Code Code Available 2Learning to Ground VLMs without Forgetting Oct 14, 2024 Decoder Language Modelling
— Unverified 0Neural Material Adaptor for Visual Grounding of Intrinsic Dynamics Oct 10, 2024 Visual Grounding
— Unverified 0GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance Oct 9, 2024 Visual Grounding
— Unverified 0Context-Aware Command Understanding for Tabletop Scenarios Oct 8, 2024 Decision Making Visual Grounding
— Unverified 0