Visually Consistent Hierarchical Image Classification Jun 17, 2024 Classification image-classification
— Unverified 0Learning Language Structures through Grounding Jun 14, 2024 Automatic Speech Recognition Dependency Parsing
— Unverified 0Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding Jun 13, 2024 3D visual grounding Attribute
— Unverified 0MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Jun 13, 2024 3D visual grounding Attribute
Code Code Available 4Towards Vision-Language Geo-Foundation Model: A Survey Jun 13, 2024 Earth Observation Image Captioning
Code Code Available 2Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation Jun 11, 2024 Grounded Multimodal Named Entity Recognition named-entity-recognition
Code Code Available 1Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Jun 9, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 2A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions Jun 9, 2024 3D visual grounding Survey
Code Code Available 3F-LMM: Grounding Frozen Large Multimodal Models Jun 9, 2024 General Knowledge Instruction Following
Code Code Available 2HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task Jun 4, 2024 Head Pose Estimation Language Modelling
— Unverified 0HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model Jun 1, 2024 Action Recognition Activity Recognition
— Unverified 0Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 1Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention May 28, 2024 3D Object Detection 3D visual grounding
— Unverified 0LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding May 27, 2024 Visual Grounding
— Unverified 0Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding May 24, 2024 3D visual grounding Autonomous Driving
— Unverified 0Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension May 21, 2024 3D visual grounding Referring Expression
Code Code Available 1Adversarial Robustness for Visual Grounding of Multimodal Large Language Models May 16, 2024 Adversarial Attack Adversarial Robustness
Code Code Available 0DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding May 10, 2024 Relation Spatial Reasoning
Code Code Available 1Visual grounding for desktop graphical user interfaces May 5, 2024 Language Modeling Language Modelling
— Unverified 0Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners Apr 30, 2024 3D visual grounding Visual Grounding
— Unverified 0BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Apr 26, 2024 Game Design Image Generation
— Unverified 0List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Apr 25, 2024 Visual Grounding Visual Question Answering
Code Code Available 2HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding Apr 20, 2024 cross-modal alignment Visual Grounding
Code Code Available 2Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Apr 19, 2024 Language Modeling Language Modelling
Code Code Available 4Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization Apr 17, 2024 3D dense captioning 3D visual grounding
Code Code Available 0MedRG: Medical Report Grounding with Multi-modal Large Language Model Apr 10, 2024 Decoder Language Modeling
— Unverified 0VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 2AgentStudio: A Toolkit for Building General Virtual Agents Mar 26, 2024 Visual Grounding
Code Code Available 3Data-Efficient 3D Visual Grounding via Order-Aware Referring Mar 25, 2024 3D visual grounding Object
— Unverified 0Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery Mar 22, 2024 Language Modeling Language Modelling
— Unverified 0MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 2VidLA: Video-Language Alignment at Scale Mar 21, 2024 Language Modelling Visual Grounding
— Unverified 0Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling Mar 21, 2024 Grounded language learning Language Acquisition
Code Code Available 1Learning from Synthetic Data for Visual Grounding Mar 20, 2024 Language Modelling Large Language Model
— Unverified 0Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory Mar 19, 2024 Adversarial Text Diversity
Code Code Available 1HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 1WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar Mar 19, 2024 Autonomous Navigation Referring Expression
— Unverified 0Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation Mar 14, 2024 Decision Making Language Modeling
— Unverified 0SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention Mar 13, 2024 3D visual grounding cross-modal alignment
Code Code Available 0Detecting Concrete Visual Tokens for Multimodal Machine Translation Mar 5, 2024 Machine Translation Multimodal Machine Translation
— Unverified 0MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding Mar 5, 2024 3D visual grounding Decision Making
Code Code Available 1Adversarial Testing for Visual Grounding via Image-Aware Property Reduction Mar 2, 2024 Visual Grounding
— Unverified 0ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Feb 27, 2024 3D geometry 3D Object Captioning
Code Code Available 3OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Feb 27, 2024 Language Modeling Language Modelling
— Unverified 0Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding Feb 23, 2024 Hallucination Object
Code Code Available 1The Revolution of Multimodal Large Language Models: A Survey Feb 19, 2024 Image Generation Instruction Following
Code Code Available 2Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions Feb 17, 2024 Visual Grounding
Code Code Available 1LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 1ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling Feb 9, 2024 Hallucination Natural Language Understanding
Code Code Available 0Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations Feb 2, 2024 Contrastive Learning Object
— Unverified 0