Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review Mar 4, 2024 Medical Report Generation Question Answering
Code Code Available 3PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers Feb 13, 2024 Question Answering Retrieval
Code Code Available 3Common Sense Reasoning for Deepfake Detection Jan 31, 2024 Binary Classification Common Sense Reasoning
Code Code Available 3TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Dec 28, 2023 Computational Efficiency Image Captioning
Code Code Available 3DriveLM: Driving with Graph Visual Question Answering Dec 21, 2023 Autonomous Driving Question Answering
Code Code Available 3Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Nov 11, 2023 Image Captioning MMR total
Code Code Available 3Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 3CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning Jun 30, 2023 Causal Inference Medical Report Generation
Code Code Available 3ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities May 18, 2023 1 Image, 2*2 Stitchi Action Classification
Code Code Available 3Champion Solution for the WSDM2023 Toloka VQA Challenge Jan 22, 2023 Question Answering Visual Grounding
Code Code Available 3Unifying Vision, Text, and Layout for Universal Document Processing Dec 5, 2022 Document AI document understanding
Code Code Available 3Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 3All You May Need for VQA are Image Captions May 4, 2022 All Image Captioning
Code Code Available 3OCR-free Document Understanding Transformer Nov 30, 2021 Document Image Classification document understanding
Code Code Available 3Ludwig: a type-based declarative deep learning toolbox Sep 17, 2019 Decoder Deep Learning
Code Code Available 3Towards VQA Models That Can Read Apr 18, 2019 TextVQA Visual Question Answering (VQA)
Code Code Available 3Pythia v0.1: the Winning Entry to the VQA Challenge 2018 Jul 26, 2018 Data Augmentation Visual Question Answering (VQA)
Code Code Available 3Bilinear Attention Networks May 21, 2018 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 3CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models Jun 11, 2025 counterfactual Descriptive
Code Code Available 2Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 2DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding Mar 13, 2025 4k Autonomous Driving
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning Mar 10, 2025 Language Modeling Language Modelling
Code Code Available 2Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 2OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 2Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization Feb 18, 2025 Image Retrieval Question Answering
Code Code Available 2ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Jan 9, 2025 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 2EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 2Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine Dec 12, 2024 Language Modeling Language Modelling
Code Code Available 2MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization Dec 9, 2024 Visual Question Answering (VQA)
Code Code Available 2Video Quality Assessment: A Comprehensive Survey Dec 4, 2024 Benchmarking Survey
Code Code Available 2Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering Nov 26, 2024 Prognosis Question Answering
Code Code Available 2Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 2ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Nov 25, 2024 AI Agent Visual Question Answering
Code Code Available 2VQA^2: Visual Question Answering for Video Quality Assessment Nov 6, 2024 Question Answering Video Quality Assessment
Code Code Available 2Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 2VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment Aug 21, 2024 Video Alignment Video Editing
Code Code Available 2PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding Aug 18, 2024 Language Modelling Question Answering
Code Code Available 2GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 2Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos Jul 23, 2024 Image Generation Point Tracking
Code Code Available 2SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Jul 12, 2024 Articles Question Answering
Code Code Available 2WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering Jul 8, 2024 Diagnostic Generative Visual Question Answering
Code Code Available 2RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models Jul 6, 2024 Medical Diagnosis RAG
Code Code Available 2MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis Jul 4, 2024 Diagnostic Language Modeling
Code Code Available 2A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding Jul 2, 2024 document understanding Key Information Extraction
Code Code Available 2TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy Jun 3, 2024 Language Modelling Question Answering
Code Code Available 2Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models Jun 3, 2024 Image Captioning Language Modelling
Code Code Available 2DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models May 31, 2024 cross-modal alignment Visual Localization
Code Code Available 2LM4LV: A Frozen Large Language Model for Low-level Vision Tasks May 24, 2024 Language Modeling Language Modelling
Code Code Available 2