HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images Dec 24, 2024 Optical Character Recognition (OCR) Question Answering
— Unverified 0LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering Dec 24, 2024 Explanatory Visual Question Answering Multimodal Reasoning
Code Code Available 0Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering Dec 24, 2024 Question Answering Visual Question Answering
— Unverified 0TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization Dec 24, 2024 In-Context Learning Question Answering
— Unverified 0Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective Dec 23, 2024 Question Answering Visual Question Answering
Code Code Available 0Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering Dec 22, 2024 Question Answering Visual Question Answering
— Unverified 0Application of Multimodal Large Language Models in Autonomous Driving Dec 21, 2024 Autonomous Driving Decision Making
— Unverified 0InstructOCR: Instruction Boosting Scene Text Spotting Dec 20, 2024 Optical Character Recognition (OCR) Text Spotting
Code Code Available 0NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization Dec 20, 2024 Compositional Generalization (AVG) Novel Concepts
Code Code Available 0Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage Dec 20, 2024 Attribute Benchmarking
— Unverified 0Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering Dec 19, 2024 Contrastive Learning Language Modeling
Code Code Available 0OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization Dec 19, 2024 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0What makes a good metric? Evaluating automatic metrics for text-to-image consistency Dec 18, 2024 Sensitivity Visual Question Answering (VQA)
— Unverified 0Optimizing Vision-Language Interactions Through Decoder-Only Models Dec 14, 2024 Decoder Image Captioning
— Unverified 0Selective State Space Memory for Large Vision-Language Models Dec 13, 2024 Mamba Visual Question Answering (VQA)
— Unverified 0VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation Dec 13, 2024 Instruction Following Question Answering
— Unverified 0Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions Dec 11, 2024 Benchmarking Question Answering
Code Code Available 0Can We Generate Visual Programs Without Prompting LLMs? Dec 11, 2024 Data Augmentation Question Answering
— Unverified 0Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models Dec 6, 2024 Hallucination Optical Character Recognition (OCR)
— Unverified 0Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Dec 6, 2024 document understanding Hallucination
— Unverified 0T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts Dec 5, 2024 Benchmarking Image Generation
— Unverified 0AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? Dec 4, 2024 Benchmarking Visual Question Answering (VQA)
— Unverified 0Copy-Move Forgery Detection and Question Answering for Remote Sensing Image Dec 3, 2024 Question Answering Visual Question Answering
Code Code Available 0WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image Dec 3, 2024 Diagnostic Language Modeling
— Unverified 0CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs Dec 3, 2024 Image Captioning Quantization
— Unverified 0DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness Nov 29, 2024 Optical Character Recognition (OCR) Question Answering
Code Code Available 0Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark Nov 29, 2024 Benchmarking Grounded Video Question Answering
— Unverified 0SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks Nov 29, 2024 Question Answering Visual Question Answering
Code Code Available 0Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers Nov 28, 2024 Image Captioning image-classification
— Unverified 0ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering? Nov 27, 2024 Question Answering Visual Question Answering
— Unverified 0Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey Nov 26, 2024 Natural Language Understanding Question Answering
— Unverified 0Task Progressive Curriculum Learning for Robust Visual Question Answering Nov 26, 2024 Data Augmentation Ensemble Learning
— Unverified 0Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models Nov 25, 2024 Visual Question Answering (VQA)
— Unverified 0GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis Nov 25, 2024 Medical Visual Question Answering Multiple-choice
— Unverified 0Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents Nov 23, 2024 Question Answering RAG
Code Code Available 0ReWind: Understanding Long Videos with Instructed Learnable Memory Nov 23, 2024 Large Language Model Question Answering
— Unverified 0Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy Nov 23, 2024 Instruction Following MME
— Unverified 0Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains Nov 22, 2024 Benchmarking Caption Generation
— Unverified 0mR^2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA Nov 22, 2024 RAG Retrieval
— Unverified 0Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset Nov 21, 2024 Question Answering Visual Grounding
Code Code Available 0Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training Nov 20, 2024 Contrastive Learning image-classification
— Unverified 0Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving Nov 20, 2024 Autonomous Driving Multimodal Reasoning
— Unverified 0Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios Nov 20, 2024 Question Answering Visual Question Answering (VQA)
— Unverified 0LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement Nov 20, 2024 Autonomous Driving Computational Efficiency
— Unverified 0Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model Nov 19, 2024 Language Modeling Language Modelling
— Unverified 0Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts Nov 18, 2024 Benchmarking Multimodal Large Language Model
Code Code Available 0Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry Nov 17, 2024 Question Answering Scene Understanding
— Unverified 0Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Nov 17, 2024 Hallucination In-Context Learning
Code Code Available 0F^3OCUS -- Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics Nov 17, 2024 Diversity Federated Learning
— Unverified 0A Comprehensive Survey on Visual Question Answering Datasets and Algorithms Nov 17, 2024 Diagnostic Miscellaneous
— Unverified 0