Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Jun 7, 2023 Diversity Image Captioning
Code Code Available 15 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving May 13, 2025 3D visual grounding Autonomous Driving
Code Code Available 15 Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision Jul 23, 2023 Decoder Visual Grounding
Code Code Available 15 Fine-Grained Semantically Aligned Vision-Language Pre-Training Aug 4, 2022 cross-modal alignment object-detection
Code Code Available 15 Collaborative Transformers for Grounded Situation Recognition Mar 30, 2022 Grounded Situation Recognition Image Classification
Code Code Available 15 Joint Visual Grounding and Tracking with Natural Language Specification Mar 21, 2023 Visual Grounding Visual Tracking
Code Code Available 15 GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection Nov 5, 2023 Anomaly Detection Question Answering
Code Code Available 15 Visual Grounding for Object-Level Generalization in Reinforcement Learning Aug 4, 2024 Language Modelling Object
Code Code Available 15 Referring Transformer: A One-step Approach to Multi-task Visual Grounding Jun 6, 2021 Decoder Referring Expression
Code Code Available 15 REX: Reasoning-aware and Grounded Explanation Mar 11, 2022 Decision Making Explanation Generation
Code Code Available 15 SeqTR: A Simple yet Universal Network for Visual Grounding Mar 30, 2022 Decoder Referring Expression
Code Code Available 15 Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training Jan 1, 2023 3D dense captioning 3D visual grounding
Code Code Available 15 CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding May 15, 2023 Diversity Transfer Learning
Code Code Available 15 InfMLLM: A Unified Framework for Visual-Language Tasks Nov 12, 2023 GPU Image Captioning
Code Code Available 15 CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision Dec 14, 2021 Contrastive Learning Representation Learning
Code Code Available 15 Learning Cross-modal Context Graph for Visual Grounding Feb 13, 2020 Graph Matching Graph Neural Network
Code Code Available 15 Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation Jul 3, 2020 Contrastive Learning Knowledge Distillation
Code Code Available 15 InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring Mar 1, 2021 3D visual grounding Attribute
Code Code Available 15 Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems Nov 21, 2024 3D visual grounding Negation
Code Code Available 15 Spatially Aware Multimodal Transformers for TextVQA Jul 23, 2020 Optical Character Recognition (OCR) Spatial Reasoning
Code Code Available 15 Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation Jun 11, 2024 Grounded Multimodal Named Entity Recognition named-entity-recognition
Code Code Available 15 Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection Feb 3, 2025 3D visual grounding Visual Grounding
Code Code Available 15 Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations Jun 30, 2022 Language Modeling Language Modelling
Code Code Available 15 Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension May 21, 2024 3D visual grounding Referring Expression
Code Code Available 15 CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data Oct 28, 2023 3D visual grounding Autonomous Vehicles
Code Code Available 15 Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning Apr 30, 2022 Attribute Decoder
Code Code Available 15 Instruction-Following Agents with Multimodal Transformer Oct 24, 2022 Instruction Following Visual Grounding
Code Code Available 15 GRAVL-BERT: Graphical Visual-Linguistic Representations for Multimodal Coreference Resolution Oct 1, 2022 coreference-resolution Coreference Resolution
Code Code Available 15 LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition Feb 15, 2024 Grounded Multimodal Named Entity Recognition Multi-modal Named Entity Recognition
Code Code Available 15 Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images Mar 14, 2021 3D visual grounding Object
Code Code Available 15 HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 15 Local-Global Context Aware Transformer for Language-Guided Video Segmentation Mar 18, 2022 Referring Expression Segmentation Referring Video Object Segmentation
Code Code Available 15 Grounded Situation Recognition with Transformers Nov 19, 2021 Decoder Grounded Situation Recognition
Code Code Available 15 CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models Sep 24, 2021 Visual Grounding
Code Code Available 15 Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans May 23, 2023 3D Reconstruction 3D visual grounding
Code Code Available 15 Mask Grounding for Referring Image Segmentation Dec 19, 2023 cross-modal alignment Image Segmentation
Code Code Available 15 GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Dec 22, 2023 Attribute object-detection
Code Code Available 15 Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 15 Guessing State Tracking for Visual Dialogue Feb 24, 2020 Visual Grounding
Code Code Available 15 CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 15 Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment Aug 29, 2022 cross-modal alignment Image-text Retrieval
Code Code Available 15 IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities Aug 23, 2024 Language Modeling Language Modelling
Code Code Available 15 MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding Mar 5, 2024 3D visual grounding Decision Making
Code Code Available 15 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection Apr 13, 2022 3D visual grounding Visual Grounding
Code Code Available 15 EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding Sep 29, 2022 3D visual grounding Object
Code Code Available 15 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding Oct 22, 2022 3D visual grounding Sentence
Code Code Available 15 How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game Mar 13, 2025 Multimodal Reasoning Question Answering
Code Code Available 15 MixGen: A New Multi-Modal Data Augmentation Jun 16, 2022 Data Augmentation Image-text Retrieval
Code Code Available 15 Improving One-stage Visual Grounding by Recursive Sub-query Construction Aug 3, 2020 Sentence Sentence Embedding
Code Code Available 15 RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning Mar 29, 2025 Chart Question Answering Chart Understanding
Code Code Available 15