Towards Visual Text Grounding of Multimodal Large Language Model Apr 7, 2025 Benchmarking Language Modeling
— Unverified 0Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities Apr 2, 2025 Descriptive Large Language Model
Code Code Available 0Image Difference Grounding with Natural Language Apr 2, 2025 Visual Grounding
— Unverified 0Multimodal Reference Visual Grounding Apr 2, 2025 Few-Shot Object Detection Visual Grounding
— Unverified 0MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing Mar 31, 2025 Object object-detection
Code Code Available 0ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning Mar 30, 2025 3D visual grounding Feature Splatting
— Unverified 0Efficient Adaptation For Remote Sensing Visual Grounding Mar 29, 2025 parameter-efficient fine-tuning Visual Grounding
— Unverified 0NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving Mar 28, 2025 3D visual grounding Autonomous Driving
— Unverified 0Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding Mar 25, 2025 Attribute Object
— Unverified 0Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes Mar 24, 2025 Cross-Modal Retrieval Disentanglement
— Unverified 0A Vision Centric Remote Sensing Benchmark Mar 20, 2025 Question Answering Representation Learning
— Unverified 0LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation Mar 18, 2025 Decoder Object
Code Code Available 0Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding Mar 8, 2025 Language Modeling Language Modelling
— Unverified 0Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions Mar 5, 2025 Anomaly Detection Visual Grounding
— Unverified 0Teaching Metric Distance to Autoregressive Multimodal Foundational Models Mar 4, 2025 Image Generation Visual Grounding
— Unverified 0Structured Preference Optimization for Vision-Language Long-Horizon Task Planning Feb 28, 2025 Task Planning Visual Grounding
— Unverified 0ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding Feb 26, 2025 3D visual grounding Visual Grounding
— Unverified 0Programming with Pixels: Computer-Use Meets Software Engineering Feb 24, 2025 Visual Grounding
— Unverified 0GroundCap: A Visually Grounded Image Captioning Dataset Feb 19, 2025 Image Captioning Object Detection
— Unverified 0Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring Feb 16, 2025 Instance Segmentation Language Modeling
— Unverified 0TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation Feb 11, 2025 Retrieval Vision and Language Navigation
— Unverified 0RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception Jan 31, 2025 Reinforcement Learning (RL) Spatial Reasoning
— Unverified 0ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations Jan 24, 2025 Decoder Object
— Unverified 0FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis Jan 17, 2025 Bayesian Inference Language Modeling
— Unverified 0AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring Jan 16, 2025 3D visual grounding Decoder
— Unverified 0GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing Jan 12, 2025 Image Captioning Language Modeling
— Unverified 0EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models Jan 6, 2025 Hallucination Visual Grounding
— Unverified 0ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding Jan 2, 2025 3D visual grounding Diagnostic
— Unverified 0Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes Jan 1, 2025 Cross-Modal Retrieval Disentanglement
— Unverified 0Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding Jan 1, 2025 3D visual grounding Data Augmentation
Code Code Available 0VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Jan 1, 2025 Large Language Model Video Segmentation
— Unverified 0Beyond Human Perception: Understanding Multi-Object World from Monocular View Jan 1, 2025 3D visual grounding Denoising
Code Code Available 0Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding Jan 1, 2025 Referring Expression Referring Expression Comprehension
— Unverified 0Referencing Where to Focus: Improving VisualGrounding with Referential Query Dec 26, 2024 Decoder Visual Grounding
— Unverified 0CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models Dec 22, 2024 Language Modeling Language Modelling
Code Code Available 0EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Dec 19, 2024 Change Detection Disaster Response
— Unverified 0FiVL: A Framework for Improved Vision-Language Alignment Dec 19, 2024 Answer Generation Multimodal Reasoning
Code Code Available 0GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting Dec 18, 2024 Scene Understanding Semantic Segmentation
— Unverified 0Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses Dec 11, 2024 Image-text Retrieval Question Answering
— Unverified 0Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models Dec 11, 2024 Question Answering Visual Grounding
Code Code Available 03D Spatial Understanding in MLLMs: Disambiguation and Evaluation Dec 9, 2024 3D dense captioning 3D visual grounding
— Unverified 0Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Dec 6, 2024 document understanding Hallucination
— Unverified 0M^3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-level Information Extraction Dec 5, 2024 Relation Extraction Visual Grounding
Code Code Available 0SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding Dec 5, 2024 3D visual grounding Object Localization
— Unverified 0Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding Dec 1, 2024 Visual Grounding
— Unverified 03D Scene Graph Guided Vision-Language Pre-training Nov 27, 2024 3D dense captioning 3D visual grounding
— Unverified 0Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset Nov 21, 2024 Question Answering Visual Grounding
Code Code Available 0Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level Nov 15, 2024 Benchmarking counterfactual
— Unverified 0LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers Nov 7, 2024 3D visual grounding Autonomous Driving
— Unverified 0VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Nov 7, 2024 Decoder Language Modeling
— Unverified 0