CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 15 Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning Apr 30, 2022 Attribute Decoder
Code Code Available 15 Deep Multimodal Neural Architecture Search Apr 25, 2020 Decoder Image-text matching
Code Code Available 15 UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling Nov 23, 2021 Image Captioning Image Description
Code Code Available 15 A Unified Framework for 3D Point Cloud Visual Grounding Aug 23, 2023 CPU GPU
Code Code Available 15 InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring Mar 1, 2021 3D visual grounding Attribute
Code Code Available 15 Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans May 23, 2023 3D Reconstruction 3D visual grounding
Code Code Available 15 Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 15 CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models Sep 24, 2021 Visual Grounding
Code Code Available 15 A Fast and Accurate One-Stage Approach to Visual Grounding Aug 18, 2019 Referring Expression Referring Expression Comprehension
Code Code Available 15 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding Oct 22, 2022 3D visual grounding Sentence
Code Code Available 15 CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding Oct 10, 2023 3D visual grounding Visual Grounding
Code Code Available 15 Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions Feb 17, 2024 Visual Grounding
Code Code Available 15 Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding Jul 18, 2023 3D visual grounding Object
Code Code Available 15 OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding Mar 13, 2021 Referring Expression Referring Expression Segmentation
Code Code Available 15 Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation Sep 17, 2021 Dialogue Generation Visual Grounding
Code Code Available 15 GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents May 21, 2025 Answer Generation Reinforcement Learning (RL)
Code Code Available 15 Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints Jan 12, 2025 Image Segmentation Referring Expression
Code Code Available 15 Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Nov 16, 2021 Cross-Modal Retrieval Image Captioning
Code Code Available 15 Visual Grounding Methods for VQA are Working for the Wrong Reasons! Apr 12, 2020 Question Answering Visual Grounding
Code Code Available 15 Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory Mar 19, 2024 Adversarial Text Diversity
Code Code Available 15 Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models Oct 9, 2023 Language Modelling Question Answering
Code Code Available 15 How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game Mar 13, 2025 Multimodal Reasoning Question Answering
Code Code Available 15 Learning Cross-modal Context Graph for Visual Grounding Nov 20, 2019 Graph Matching Graph Neural Network
Code Code Available 15 Multi-Modal Dynamic Graph Transformer for Visual Grounding Jan 1, 2022 Visual Grounding
Code Code Available 15 SAT: 2D Semantics Assisted Training for 3D Visual Grounding May 24, 2021 3D visual grounding Object
Code Code Available 15 Multi-View Transformer for 3D Visual Grounding Apr 5, 2022 3D visual grounding Visual Grounding
Code Code Available 15 Context Disentangling and Prototype Inheriting for Robust Visual Grounding Dec 19, 2023 Visual Grounding
Code Code Available 15 3D Vision and Language Pretraining with Large-Scale Synthetic Data Jul 8, 2024 Dense Captioning Diversity
Code Code Available 15 Guessing State Tracking for Visual Dialogue Feb 24, 2020 Visual Grounding
Code Code Available 15 Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training Jan 1, 2023 3D dense captioning 3D visual grounding
Code Code Available 15 Connecting What to Say With Where to Look by Modeling Human Attention Traces May 12, 2021 Caption Generation Image Captioning
Code Code Available 15 NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning Feb 1, 2025 Referring Expression Visual Grounding
Code Code Available 15 Visual Grounding for Object-Level Generalization in Reinforcement Learning Aug 4, 2024 Language Modelling Object
Code Code Available 15 Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding Jan 1, 2023 Descriptive Object
Code Code Available 15 Grounded Situation Recognition with Transformers Nov 19, 2021 Decoder Grounded Situation Recognition
Code Code Available 15 Mono3DVG: 3D Visual Grounding in Monocular Images Dec 13, 2023 3D Object Detection 3D visual grounding
Code Code Available 15 mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections May 24, 2022 Computational Efficiency cross-modal alignment
Code Code Available 15 MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding Mar 5, 2024 3D visual grounding Decision Making
Code Code Available 15 MixGen: A New Multi-Modal Data Augmentation Jun 16, 2022 Data Augmentation Image-text Retrieval
Code Code Available 15 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving May 13, 2025 3D visual grounding Autonomous Driving
Code Code Available 15 Collaborative Transformers for Grounded Situation Recognition Mar 30, 2022 Grounded Situation Recognition Image Classification
Code Code Available 15 GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection Nov 5, 2023 Anomaly Detection Question Answering
Code Code Available 15 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Dec 22, 2023 Attribute object-detection
Code Code Available 15 Multi3DRefer: Grounding Text Description to Multiple 3D Objects Sep 11, 2023 3D visual grounding Contrastive Learning
Code Code Available 15 CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding May 15, 2023 Diversity Transfer Learning
Code Code Available 15 Advancing Visual Grounding with Scene Knowledge: Benchmark and Method Jul 21, 2023 Image-text matching Text Matching
Code Code Available 15 Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding Nov 25, 2022 3D visual grounding Knowledge Distillation
Code Code Available 15 CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision Dec 14, 2021 Contrastive Learning Representation Learning
Code Code Available 15 Local-Global Context Aware Transformer for Language-Guided Video Segmentation Mar 18, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15