Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Oct 7, 2024 Natural Language Visual Grounding Navigate
Code Code Available 3VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks Oct 7, 2024 Information Retrieval Language Modeling
— Unverified 0Adaptive Masking Enhances Visual Grounding Oct 4, 2024 Few-Shot Learning Visual Grounding
Code Code Available 0World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering Sep 30, 2024 Optical Character Recognition (OCR) Question Answering
Code Code Available 0Individuation in Neural Models with and without Visual Grounding Sep 27, 2024 Visual Grounding
— Unverified 0ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue Sep 26, 2024 Medical Visual Question Answering Question Answering
— Unverified 0SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion Sep 26, 2024 Descriptive Generalized Referring Expression Comprehension
Code Code Available 2HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models Sep 16, 2024 Attribute Decoder
Code Code Available 0Bayesian Self-Training for Semi-Supervised 3D Segmentation Sep 12, 2024 3D Instance Segmentation 3D Semantic Segmentation
— Unverified 0Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling Sep 9, 2024 Language Modeling Language Modelling
Code Code Available 0Visual Grounding with Multi-modal Conditional Adaptation Sep 8, 2024 object-detection Object Detection
Code Code Available 1Visual Prompting in Multimodal Large Language Models: A Survey Sep 5, 2024 In-Context Learning Prompt Learning
— Unverified 0Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Sep 5, 2024 Question Answering Scene Understanding
Code Code Available 2NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar Aug 30, 2024 Autonomous Driving Visual Grounding
— Unverified 0ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding Aug 29, 2024 Data Augmentation Image Generation
Code Code Available 0M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation Aug 29, 2024 Instruction Following Medical Report Generation
— Unverified 0MMR: Evaluating Reading Ability of Large Multimodal Models Aug 26, 2024 Font Recognition MMR total
— Unverified 0IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities Aug 23, 2024 Language Modeling Language Modelling
Code Code Available 1Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models Aug 15, 2024 Pose Estimation Visual Grounding
— Unverified 0In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation Aug 9, 2024 Image to text Object
Code Code Available 2Task-oriented Sequential Grounding in 3D Scenes Aug 7, 2024 3D visual grounding Visual Grounding
— Unverified 0Visual Grounding for Object-Level Generalization in Reinforcement Learning Aug 4, 2024 Language Modelling Object
Code Code Available 1An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding Aug 2, 2024 Decoder Reasoning Segmentation
Code Code Available 1UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models Jul 25, 2024 Computational Efficiency Question Answering
— Unverified 0RefMask3D: Language-Guided Transformer for 3D Referring Segmentation Jul 25, 2024 3D visual grounding Image Segmentation
Code Code Available 2Unveiling and Mitigating Bias in Audio Visual Segmentation Jul 23, 2024 Attribute Visual Grounding
— Unverified 0PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding Jul 19, 2024 3D visual grounding Attribute
— Unverified 0Learning Visual Grounding from Generative Vision and Language Model Jul 18, 2024 Attribute Language Modeling
— Unverified 0Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models Jul 18, 2024 3D Semantic Segmentation Semantic Segmentation
— Unverified 0VIMI: Grounding Video Generation through Multi-modal Instruction Jul 8, 2024 Text-to-Video Generation Video Generation
— Unverified 03D Vision and Language Pretraining with Large-Scale Synthetic Data Jul 8, 2024 Dense Captioning Diversity
Code Code Available 1Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model Jul 7, 2024 Segmentation Sentence
Code Code Available 0Multi-branch Collaborative Learning Network for 3D Visual Grounding Jul 7, 2024 3D visual grounding Referring Expression
Code Code Available 1Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge Jul 5, 2024 Cross-Modal Retrieval Question Answering
— Unverified 0Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition Jul 5, 2024 Visual Grounding Visual Storytelling
Code Code Available 0Smart Vision-Language Reasoners Jul 5, 2024 Math Mathematical Reasoning
Code Code Available 0ACTRESS: Active Retraining for Semi-supervised Visual Grounding Jul 3, 2024 Binary Classification Visual Grounding
— Unverified 0Visual Grounding with Attention-Driven Constraint Balancing Jul 3, 2024 Object object-detection
— Unverified 0SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Jul 3, 2024 object-detection Object Detection
Code Code Available 2The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA Jul 2, 2024 Grounded Video Question Answering Object Tracking
— Unverified 0CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation Jul 1, 2024 Image-text Retrieval Question Answering
Code Code Available 1ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities Jul 1, 2024 3D visual grounding Language Modeling
— Unverified 0From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models Jun 28, 2024 Diversity Retrieval
— Unverified 0FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts Jun 27, 2024 Decision Making Logical Reasoning
— Unverified 0On the Role of Visual Grounding in VQA Jun 26, 2024 Visual Grounding Visual Question Answering (VQA)
— Unverified 0Towards Open-World Grasping with Large Vision-Language Models Jun 26, 2024 Robotic Grasping Visual Grounding
— Unverified 0Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Jun 24, 2024 Representation Learning Visual Grounding
Code Code Available 5AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Jun 18, 2024 Object Response Generation
Code Code Available 2VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding Jun 18, 2024 Image Captioning Question Answering
Code Code Available 2Visually Consistent Hierarchical Image Classification Jun 17, 2024 Classification image-classification
— Unverified 0