SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling Feb 1, 2024 Diversity Image Captioning
— Unverified 0LCV2: An Efficient Pretraining-Free Framework for Grounded Visual Question Answering Jan 29, 2024 Language Modeling Language Modelling
— Unverified 0ChatterBox: Multi-round Multimodal Referring and Grounding Jan 24, 2024 Language Modeling Language Modelling
Code Code Available 2Unifying Visual and Vision-Language Tracking via Contrastive Learning Jan 20, 2024 Contrastive Learning Object Tracking
Code Code Available 1Veagle: Advancements in Multimodal Representation Learning Jan 18, 2024 Image Captioning Language Modelling
Code Code Available 1SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Jan 18, 2024 Instruction Following Language Modeling
Code Code Available 2SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding Jan 17, 2024 3D visual grounding Scene Understanding
— Unverified 0Uncovering the Full Potential of Visual Grounding Methods in VQA Jan 15, 2024 Question Answering Visual Grounding
Code Code Available 0Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Jan 11, 2024 Representation Learning Self-Supervised Learning
Code Code Available 3Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers Jan 3, 2024 Question Answering Visual Grounding
— Unverified 0Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation Jan 1, 2024 Descriptive Object
Code Code Available 2Investigating Compositional Challenges in Vision-Language Models for Visual Grounding Jan 1, 2024 Attribute Relation
Code Code Available 0Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency Jan 1, 2024 3D visual grounding Relation
Code Code Available 0LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation Jan 1, 2024 Image Segmentation Semantic Segmentation
— Unverified 0G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding Jan 1, 2024 3D visual grounding Visual Grounding
— Unverified 0When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach Jan 1, 2024 Scene Understanding Visual Grounding
— Unverified 0Multi-Attribute Interactions Matter for 3D Visual Grounding Jan 1, 2024 3D visual grounding Attribute
Code Code Available 0Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding Jan 1, 2024 Scene Understanding Visual Grounding
— Unverified 0Viewpoint-Aware Visual Grounding in 3D Scenes Jan 1, 2024 3D visual grounding Referring Expression
— Unverified 0V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs Jan 1, 2024 Visual Grounding World Knowledge
Code Code Available 4Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation Dec 29, 2023 Visual Grounding
— Unverified 0One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts Dec 28, 2023 All Anatomy
Code Code Available 2Cycle-Consistency Learning for Captioning and Grounding Dec 23, 2023 Image Captioning Visual Grounding
— Unverified 0GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Dec 22, 2023 Attribute object-detection
Code Code Available 1Mask Grounding for Referring Image Segmentation Dec 19, 2023 cross-modal alignment Image Segmentation
Code Code Available 1Context Disentangling and Prototype Inheriting for Robust Visual Grounding Dec 19, 2023 Visual Grounding
Code Code Available 1Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment Dec 15, 2023 3D visual grounding Natural Language Queries
— Unverified 0Mono3DVG: 3D Visual Grounding in Monocular Images Dec 13, 2023 3D Object Detection 3D visual grounding
Code Code Available 1Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation Dec 13, 2023 Descriptive Object
Code Code Available 1Visual Grounding of Whole Radiology Reports for 3D CT Images Dec 8, 2023 Segmentation Visual Grounding
— Unverified 0Improved Visual Grounding through Self-Consistent Explanations Dec 7, 2023 Language Modelling Large Language Model
— Unverified 0GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models Dec 6, 2023 Autonomous Driving Autonomous Vehicles
Code Code Available 1Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment Dec 5, 2023 Explanation Generation Visual Grounding
Code Code Available 0Uni3DL: Unified Model for 3D and Language Understanding Dec 5, 2023 Cross-Modal Retrieval Instance Segmentation
— Unverified 0Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment Dec 4, 2023 Grounded language learning Language Modeling
— Unverified 0Aligning and Prompting Everything All at Once for Universal Visual Perception Dec 4, 2023 All Object
Code Code Available 2Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models Dec 3, 2023 Hallucination Visual Grounding
Code Code Available 0G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training Dec 3, 2023 object-detection Object Detection
Code Code Available 0Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions Nov 28, 2023 Disentanglement Referring Expression
Code Code Available 1Context-Aware Indoor Point Cloud Object Generation through User Instructions Nov 26, 2023 Position Visual Grounding
— Unverified 0Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding Nov 26, 2023 3D visual grounding Object
Code Code Available 1Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models Nov 21, 2023 Image Segmentation Language Modelling
Code Code Available 0InfMLLM: A Unified Framework for Visual-Language Tasks Nov 12, 2023 GPU Image Captioning
Code Code Available 1Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Nov 10, 2023 Diversity Multi-Task Learning
Code Code Available 1Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter Nov 9, 2023 Object Visual Grounding
Code Code Available 1NExT-Chat: An LMM for Chat, Detection and Segmentation Nov 8, 2023 Referring Expression Referring Expression Segmentation
Code Code Available 2GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection Nov 5, 2023 Anomaly Detection Question Answering
Code Code Available 1A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis Oct 31, 2023 Descriptive Medical Image Analysis
— Unverified 0CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data Oct 28, 2023 3D visual grounding Autonomous Vehicles
Code Code Available 1GROOViST: A Metric for Grounding Objects in Visual Storytelling Oct 26, 2023 Visual Grounding Visual Storytelling
Code Code Available 0