V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? Aug 20, 2024 Few-Shot Learning In-Context Learning
Code Code Available 1FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant Aug 19, 2024 Descriptive Face Swapping
Code Code Available 1Visual Agents as Fast and Slow Thinkers Aug 16, 2024 Question Answering Reasoning Segmentation
Code Code Available 1Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery Aug 9, 2024 Contrastive Learning Medical Visual Question Answering
Code Code Available 1Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark Jul 18, 2024 GPU Image Retrieval
Code Code Available 1ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment Jul 16, 2024 Optical Flow Estimation Video Compression
Code Code Available 1FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding Jul 6, 2024 Optical Character Recognition (OCR) Visual Question Answering (VQA)
Code Code Available 1From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis Jun 28, 2024 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 1STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering Jun 28, 2024 Medical Diagnosis Medical Question Answering
Code Code Available 1LIVE: Learnable In-Context Vector for Visual Question Answering Jun 19, 2024 In-Context Learning Question Answering
Code Code Available 1MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model Jun 17, 2024 Language Modeling Language Modelling
Code Code Available 1FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Jun 16, 2024 Diversity Multiple-choice
Code Code Available 1Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps Jun 14, 2024 Question Answering Visual Question Answering
Code Code Available 1Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA May 30, 2024 Diagnostic Medical Diagnosis
Code Code Available 1Instruction-Guided Visual Masking May 30, 2024 Instruction Following Visual Grounding
Code Code Available 1Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs May 29, 2024 Image Retrieval Question Answering
Code Code Available 1PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery May 22, 2024 Question Answering Visual Question Answering
Code Code Available 1Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance May 6, 2024 Exposure Correction Video Enhancement
Code Code Available 1ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images Apr 29, 2024 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 1LaPA: Latent Prompt Assist Model For Medical Visual Question Answering Apr 19, 2024 Medical Visual Question Answering Question Answering
Code Code Available 1TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding Apr 15, 2024 Question Answering Visual Question Answering (VQA)
Code Code Available 1Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts Apr 12, 2024 Image Captioning Question Answering
Code Code Available 1JDocQA: Japanese Document Question Answering Dataset for Generative Language Models Mar 28, 2024 Hallucination Question Answering
Code Code Available 1Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective Mar 27, 2024 Question Answering Visual Question Answering
Code Code Available 1IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models Mar 23, 2024 Common Sense Reasoning In-Context Learning
Code Code Available 1Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering Mar 21, 2024 object-detection Object Detection
Code Code Available 1HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning Mar 19, 2024 Reinforcement Learning (RL) Visual Grounding
Code Code Available 1PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset Mar 17, 2024 Attribute Common Sense Reasoning
Code Code Available 1Multi-modal Auto-regressive Modeling via Visual Words Mar 12, 2024 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 1Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models Mar 12, 2024 Concept Alignment Instruction Following
Code Code Available 1Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA Feb 24, 2024 3D Question Answering (3D-QA) Question Answering
Code Code Available 1Uncertainty-Aware Evaluation for Vision-Language Models Feb 22, 2024 Conformal Prediction Language Modeling
Code Code Available 1Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment Feb 21, 2024 Language Modelling Question Answering
Code Code Available 1Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models Feb 16, 2024 Diversity Instruction Following
Code Code Available 1Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy Feb 11, 2024 Language Modeling Open Vocabulary Attribute Detection
Code Code Available 1Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations Feb 10, 2024 Diagnostic Hallucination
Code Code Available 1Text-Guided Image Clustering Feb 5, 2024 Clustering Image Captioning
Code Code Available 1Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge Jan 19, 2024 Question Answering Question Generation
Code Code Available 1Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation Jan 18, 2024 Contrastive Learning Prompt Engineering
Code Code Available 1Veagle: Advancements in Multimodal Representation Learning Jan 18, 2024 Image Captioning Language Modelling
Code Code Available 1Cross-modal Retrieval for Knowledge-based Visual Question Answering Jan 11, 2024 Cross-Modal Retrieval Question Answering
Code Code Available 1MISS: A Generative Pretraining and Finetuning Approach for Med-VQA Jan 10, 2024 Medical Visual Question Answering Multi-Task Learning
Code Code Available 13DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding Jan 6, 2024 Scene Understanding Visual Question Answering (VQA)
Code Code Available 1Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training Jan 4, 2024 Descriptive Image Captioning
Code Code Available 1Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA Dec 21, 2023 Contrastive Learning counterfactual
Code Code Available 1InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Dec 21, 2023 Image Retrieval Image-to-Text Retrieval
Code Code Available 1EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering Dec 19, 2023 Object Object Counting
Code Code Available 1HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles Dec 18, 2023 Question Answering Visual Question Answering
Code Code Available 1ViLA: Efficient Video-Language Alignment for Video Question Answering Dec 13, 2023 cross-modal alignment Language Modeling
Code Code Available 1Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator Dec 11, 2023 Image Captioning Question Answering
Code Code Available 1