Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct 17, 2022 Few-Shot Learning Image Captioning
Code Code Available 35 HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Jun 27, 2024 Visual Question Answering (VQA)
Code Code Available 35 CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning Jun 30, 2023 Causal Inference Medical Report Generation
Code Code Available 35 Unifying Vision, Text, and Layout for Universal Document Processing Dec 5, 2022 Document AI document understanding
Code Code Available 35 TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Dec 28, 2023 Computational Efficiency Image Captioning
Code Code Available 35 Towards VQA Models That Can Read Apr 18, 2019 TextVQA Visual Question Answering (VQA)
Code Code Available 35 Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 35 An Empirical Study on Prompt Compression for Large Language Models Apr 24, 2025 Articles Math
Code Code Available 35 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 35 Evaluating Text-to-Visual Generation with Image-to-Text Generation Apr 1, 2024 Image to text Question Answering
Code Code Available 35 Bilinear Attention Networks May 21, 2018 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 35 OCR-free Document Understanding Transformer Nov 30, 2021 Document Image Classification document understanding
Code Code Available 35 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models Oct 16, 2024 Diagnostic Hallucination
Code Code Available 35 PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers Feb 13, 2024 Question Answering Retrieval
Code Code Available 35 DriveLM: Driving with Graph Visual Question Answering Dec 21, 2023 Autonomous Driving Question Answering
Code Code Available 35 MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Aug 6, 2024 Medical Visual Question Answering Organ Detection
Code Code Available 35 Pythia v0.1: the Winning Entry to the VQA Challenge 2018 Jul 26, 2018 Data Augmentation Visual Question Answering (VQA)
Code Code Available 35 Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review Mar 4, 2024 Medical Report Generation Question Answering
Code Code Available 35 PaLM-E: An Embodied Multimodal Language Model Mar 6, 2023 Language Modeling Language Modelling
Code Code Available 25 PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding Aug 18, 2024 Language Modelling Question Answering
Code Code Available 25 Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering Nov 26, 2024 Prognosis Question Answering
Code Code Available 25 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 25 PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging Jan 5, 2024 Medical Report Generation Medical Visual Question Answering
Code Code Available 25 OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models May 13, 2023 Key Information Extraction Nutrition
Code Code Available 25 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 25 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 25 Neighbourhood Representative Sampling for Efficient End-to-end Video Quality Assessment Oct 11, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 25 Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 25 CoLLaVO: Crayon Large Language and Vision mOdel Feb 17, 2024 Large Language Model model
Code Code Available 25 NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results Apr 17, 2024 Form valid
Code Code Available 25 MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering May 20, 2024 Benchmarking Question Answering
Code Code Available 25 All in One: Exploring Unified Video-Language Pre-training Mar 14, 2022 All Language Modelling
Code Code Available 25 Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models Apr 16, 2024 image-classification Image Classification
Code Code Available 25 NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario May 24, 2023 Autonomous Driving Question Answering
Code Code Available 25 PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents Mar 13, 2023 image-classification Image Classification
Code Code Available 25 MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 25 Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Jun 26, 2023 Hallucination Visual Question Answering
Code Code Available 25 MDETR - Modulated Detection for End-to-End Multi-Modal Understanding Jan 1, 2021 Phrase Grounding Question Answering
Code Code Available 25 LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning Jun 13, 2022 Transfer Learning Visual Question Answering (VQA)
Code Code Available 25 LM4LV: A Frozen Large Language Model for Low-level Vision Tasks May 24, 2024 Language Modeling Language Modelling
Code Code Available 25 A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding Jul 2, 2024 document understanding Key Information Extraction
Code Code Available 25 MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis Jul 4, 2024 Diagnostic Language Modeling
Code Code Available 25 Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 25 Med-Flamingo: a Multimodal Medical Few-shot Learner Jul 27, 2023 Medical Visual Question Answering Question Answering
Code Code Available 25 Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Sep 20, 2022 Multimodal Deep Learning Multimodal Reasoning
Code Code Available 25 CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models Jun 11, 2025 counterfactual Descriptive
Code Code Available 25 Learning to Compose Dynamic Tree Structures for Visual Contexts Dec 5, 2018 Graph Generation Panoptic Scene Graph Generation
Code Code Available 25 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning May 11, 2023 1 Image, 2*2 Stitching Diversity
Code Code Available 25 KVQ: Kwai Video Quality Assessment for Short-form Videos Feb 11, 2024 Form Video Quality Assessment
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25