PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging Jan 5, 2024 Medical Report Generation Medical Visual Question Answering
Code Code Available 25 Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 25 A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding Jul 2, 2024 document understanding Key Information Extraction
Code Code Available 25 Retrieval Augmented Visual Question Answering with Outside Knowledge Oct 7, 2022 Answer Generation Diagnostic
Code Code Available 25 ScreenAI: A Vision-Language Model for UI and Infographics Understanding Feb 7, 2024 Chart Question Answering Language Modeling
Code Code Available 25 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 25 VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment Aug 21, 2024 Video Alignment Video Editing
Code Code Available 25 OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models May 13, 2023 Key Information Extraction Nutrition
Code Code Available 25 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 25 NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario May 24, 2023 Autonomous Driving Question Answering
Code Code Available 25 NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results Apr 17, 2024 Form valid
Code Code Available 25 Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine Dec 12, 2024 Language Modeling Language Modelling
Code Code Available 25 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 25 Neighbourhood Representative Sampling for Efficient End-to-end Video Quality Assessment Oct 11, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 25 DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models May 31, 2024 cross-modal alignment Visual Localization
Code Code Available 25 VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 25 Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 25 MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering May 20, 2024 Benchmarking Question Answering
Code Code Available 25 PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents Mar 13, 2023 image-classification Image Classification
Code Code Available 25 MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization Dec 9, 2024 Visual Question Answering (VQA)
Code Code Available 25 MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis Jul 4, 2024 Diagnostic Language Modeling
Code Code Available 25 Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models Apr 16, 2024 image-classification Image Classification
Code Code Available 25 Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 25 Visual Programming: Compositional visual reasoning without training Nov 18, 2022 In-Context Learning Question Answering
Code Code Available 25 VQA^2: Visual Question Answering for Video Quality Assessment Nov 6, 2024 Question Answering Video Quality Assessment
Code Code Available 25 When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning Mar 10, 2025 Language Modeling Language Modelling
Code Code Available 25 Med-Flamingo: a Multimodal Medical Few-shot Learner Jul 27, 2023 Medical Visual Question Answering Question Answering
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25 CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts May 9, 2024 Image Captioning Instruction Following
Code Code Available 25 LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning Jun 13, 2022 Transfer Learning Visual Question Answering (VQA)
Code Code Available 25 MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 25 Learning to Compose Dynamic Tree Structures for Visual Contexts Dec 5, 2018 Graph Generation Panoptic Scene Graph Generation
Code Code Available 25 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning May 11, 2023 1 Image, 2*2 Stitching Diversity
Code Code Available 25 Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Sep 20, 2022 Multimodal Deep Learning Multimodal Reasoning
Code Code Available 25 Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 25 HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models Oct 23, 2023 Diagnostic Hallucination
Code Code Available 25 GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 25 GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest Jul 7, 2023 Attribute Common Sense Reasoning
Code Code Available 25 GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering Feb 4, 2024 Language Modeling Language Modelling
Code Code Available 25 Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos Jul 23, 2024 Image Generation Point Tracking
Code Code Available 25 Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 25 KVQ: Kwai Video Quality Assessment for Short-form Videos Feb 11, 2024 Form Video Quality Assessment
Code Code Available 25 GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Sep 29, 2023 Image to text Passage Retrieval
Code Code Available 25 CoLLaVO: Crayon Large Language and Vision mOdel Feb 17, 2024 Large Language Model model
Code Code Available 25 LM4LV: A Frozen Large Language Model for Low-level Vision Tasks May 24, 2024 Language Modeling Language Modelling
Code Code Available 25 PoseScript: Linking 3D Human Poses and Natural Language Oct 21, 2022 Cross-Modal Retrieval Image Captioning
Code Code Available 25 CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Dec 20, 2016 Diagnostic Question Answering
Code Code Available 15 Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation Dec 22, 2021 Common Sense Reasoning Question Answering
Code Code Available 15 Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 15