ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers Dec 27, 2024 Image Captioning Question Answering
— Unverified 0FineVQ: Fine-Grained User Generated Content Video Quality Assessment Dec 26, 2024 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering Dec 24, 2024 Question Answering Visual Question Answering
— Unverified 0TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization Dec 24, 2024 In-Context Learning Question Answering
— Unverified 0EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 2LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering Dec 24, 2024 Explanatory Visual Question Answering Multimodal Reasoning
Code Code Available 0HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images Dec 24, 2024 Optical Character Recognition (OCR) Question Answering
— Unverified 0Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective Dec 23, 2024 Question Answering Visual Question Answering
Code Code Available 0Prompting Large Language Models with Rationale Heuristics for Knowledge-based Visual Question Answering Dec 22, 2024 Question Answering Visual Question Answering
— Unverified 0Application of Multimodal Large Language Models in Autonomous Driving Dec 21, 2024 Autonomous Driving Decision Making
— Unverified 0Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage Dec 20, 2024 Attribute Benchmarking
— Unverified 0NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization Dec 20, 2024 Compositional Generalization (AVG) Novel Concepts
Code Code Available 0InstructOCR: Instruction Boosting Scene Text Spotting Dec 20, 2024 Optical Character Recognition (OCR) Text Spotting
Code Code Available 0Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering Dec 19, 2024 Contrastive Learning Language Modeling
Code Code Available 0OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization Dec 19, 2024 Video Quality Assessment Visual Question Answering (VQA)
— Unverified 0MedCoT: Medical Chain of Thought via Hierarchical Expert Dec 18, 2024 Diagnostic Medical Visual Question Answering
Code Code Available 1What makes a good metric? Evaluating automatic metrics for text-to-image consistency Dec 18, 2024 Sensitivity Visual Question Answering (VQA)
— Unverified 0Optimizing Vision-Language Interactions Through Decoder-Only Models Dec 14, 2024 Decoder Image Captioning
— Unverified 0Selective State Space Memory for Large Vision-Language Models Dec 13, 2024 Mamba Visual Question Answering (VQA)
— Unverified 0VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation Dec 13, 2024 Instruction Following Question Answering
— Unverified 0Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine Dec 12, 2024 Language Modeling Language Modelling
Code Code Available 2Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Dec 12, 2024 EgoSchema
Code Code Available 3Fast Prompt Alignment for Text-to-Image Generation Dec 11, 2024 Image Generation In-Context Learning
Code Code Available 1Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions Dec 11, 2024 Benchmarking Question Answering
Code Code Available 0Can We Generate Visual Programs Without Prompting LLMs? Dec 11, 2024 Data Augmentation Question Answering
— Unverified 0IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents Dec 10, 2024 Cross-Modal Retrieval Image Classification
Code Code Available 1MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization Dec 9, 2024 Visual Question Answering (VQA)
Code Code Available 2Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Dec 6, 2024 document understanding Hallucination
— Unverified 0MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Dec 6, 2024 Multimodal Reasoning Visual Question Answering
Code Code Available 1Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models Dec 6, 2024 Hallucination Optical Character Recognition (OCR)
— Unverified 0Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 3T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts Dec 5, 2024 Benchmarking Image Generation
— Unverified 0Video Quality Assessment: A Comprehensive Survey Dec 4, 2024 Benchmarking Survey
Code Code Available 2AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? Dec 4, 2024 Benchmarking Visual Question Answering (VQA)
— Unverified 0WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image Dec 3, 2024 Diagnostic Language Modeling
— Unverified 0Copy-Move Forgery Detection and Question Answering for Remote Sensing Image Dec 3, 2024 Question Answering Visual Question Answering
Code Code Available 0CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs Dec 3, 2024 Image Captioning Quantization
— Unverified 0DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness Nov 29, 2024 Optical Character Recognition (OCR) Question Answering
Code Code Available 0SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks Nov 29, 2024 Question Answering Visual Question Answering
Code Code Available 0Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark Nov 29, 2024 Benchmarking Grounded Video Question Answering
— Unverified 0Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers Nov 28, 2024 Image Captioning image-classification
— Unverified 0ElectroVizQA: How well do Multi-modal LLMs perform in Electronics Visual Question Answering? Nov 27, 2024 Question Answering Visual Question Answering
— Unverified 0Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering Nov 26, 2024 Prognosis Question Answering
Code Code Available 2Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 2AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM Nov 26, 2024 Benchmarking Text-to-Video Generation
Code Code Available 1Task Progressive Curriculum Learning for Robust Visual Question Answering Nov 26, 2024 Data Augmentation Ensemble Learning
— Unverified 0Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey Nov 26, 2024 Natural Language Understanding Question Answering
— Unverified 0GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis Nov 25, 2024 Medical Visual Question Answering Multiple-choice
— Unverified 0Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models Nov 25, 2024 Visual Question Answering (VQA)
— Unverified 0ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Nov 25, 2024 AI Agent Visual Question Answering
Code Code Available 2