VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI Oct 15, 2024 Question Answering Video Question Answering
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Jun 30, 2025 Caption Generation Object
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 Reasoning to Attend: Try to Understand How <SEG> Token Works Dec 23, 2024 Semantic Similarity Semantic Textual Similarity
Code Code Available 25 VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis Mar 29, 2024 Hallucination Image Captioning
Code Code Available 25 Aligning and Prompting Everything All at Once for Universal Visual Perception Dec 4, 2023 All Object
Code Code Available 25 Referring Image Matting Jun 10, 2022 Domain Generalization Image Matting
Code Code Available 25 LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent Sep 21, 2023 3D visual grounding Language Modeling
Code Code Available 25 SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Jul 3, 2024 object-detection Object Detection
Code Code Available 25 NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 25 BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Jul 17, 2023 Instruction Following Sentence
Code Code Available 25 Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Jan 1, 2025 Hallucination Response Generation
Code Code Available 25 One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Dec 28, 2023 All Anatomy
Code Code Available 25 ChatterBox: Multi-round Multimodal Referring and Grounding Jan 24, 2024 Language Modeling Language Modelling
Code Code Available 25 MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 25 GTA1: GUI Test-time Scaling Agent Jul 8, 2025 Reinforcement Learning (RL) Task Planning
Code Code Available 25 HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding Apr 20, 2024 cross-modal alignment Visual Grounding
Code Code Available 25 Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language Jun 9, 2024 Contrastive Learning Cross-Modal Retrieval
Code Code Available 25 GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models Dec 6, 2023 Autonomous Driving Autonomous Vehicles
Code Code Available 15 GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution Oct 1, 2022 coreference-resolution Coreference Resolution
Code Code Available 15 Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory Mar 19, 2024 Adversarial Text Diversity
Code Code Available 15 Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 15 LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 15 Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions Feb 17, 2024 Visual Grounding
Code Code Available 15 Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling Mar 21, 2024 Grounded language learning Language Acquisition
Code Code Available 15 Local-Global Context Aware Transformer for Language-Guided Video Segmentation Mar 18, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15 Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter Nov 9, 2023 Object Visual Grounding
Code Code Available 15 Learning Cross-modal Context Graph for Visual Grounding Feb 13, 2020 Graph Matching Graph Neural Network
Code Code Available 15 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Nov 10, 2023 Diversity Multi-Task Learning
Code Code Available 15 PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model Jan 21, 2025 Hallucination Image Captioning
Code Code Available 15 Kosmos-2: Grounding Multimodal Large Language Models to the World Jun 26, 2023 Image Captioning In-Context Learning
Code Code Available 15 Learning Cross-modal Context Graph for Visual Grounding Nov 20, 2019 Graph Matching Graph Neural Network
Code Code Available 15 Fine-Grained Semantically Aligned Vision-Language Pre-Training Aug 4, 2022 cross-modal alignment object-detection
Code Code Available 15 Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision Jul 23, 2023 Decoder Visual Grounding
Code Code Available 15 An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding Aug 2, 2024 Decoder Reasoning Segmentation
Code Code Available 15 Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation Apr 5, 2021 Object Visual Grounding
Code Code Available 15 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving May 13, 2025 3D visual grounding Autonomous Driving
Code Code Available 15 Joint Visual Grounding and Tracking with Natural Language Specification Mar 21, 2023 Visual Grounding Visual Tracking
Code Code Available 15 Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding Nov 25, 2022 3D visual grounding Knowledge Distillation
Code Code Available 15 CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 15 UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Nov 23, 2021 Image Captioning Image Description
Code Code Available 15 A Unified Framework for 3D Point Cloud Visual Grounding Aug 23, 2023 CPU GPU
Code Code Available 15 Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans May 23, 2023 3D Reconstruction 3D visual grounding
Code Code Available 15 CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models Sep 24, 2021 Visual Grounding
Code Code Available 15 A Fast and Accurate One-Stage Approach to Visual Grounding Aug 18, 2019 Referring Expression Referring Expression Comprehension
Code Code Available 15 Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection Feb 3, 2025 3D visual grounding Visual Grounding
Code Code Available 15 EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding Sep 29, 2022 3D visual grounding Object
Code Code Available 15 CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding Oct 10, 2023 3D visual grounding Visual Grounding
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15