Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Apr 13, 2020 Cross-Modal Retrieval Image Captioning
Code Code Available 25 PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging Jan 5, 2024 Medical Report Generation Medical Visual Question Answering
Code Code Available 25 Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization Feb 18, 2025 Image Retrieval Question Answering
Code Code Available 25 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding Jan 9, 2025 Visual Question Answering (VQA) Visual Reasoning
Code Code Available 25 RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models Jul 6, 2024 Medical Diagnosis RAG
Code Code Available 25 ScreenAI: A Vision-Language Model for UI and Infographics Understanding Feb 7, 2024 Chart Question Answering Language Modeling
Code Code Available 25 NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario May 24, 2023 Autonomous Driving Question Answering
Code Code Available 25 DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models May 31, 2024 cross-modal alignment Visual Localization
Code Code Available 25 SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Jul 12, 2024 Articles Question Answering
Code Code Available 25 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 25 Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 25 NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results Apr 17, 2024 Form valid
Code Code Available 25 MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering May 20, 2024 Benchmarking Question Answering
Code Code Available 25 CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models Jun 11, 2025 counterfactual Descriptive
Code Code Available 25 Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models Mar 29, 2024 Question Answering Visual Question Answering
Code Code Available 25 VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 25 Neighbourhood Representative Sampling for Efficient End-to-end Video Quality Assessment Oct 11, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 25 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 25 MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization Dec 9, 2024 Visual Question Answering (VQA)
Code Code Available 25 MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis Jul 4, 2024 Diagnostic Language Modeling
Code Code Available 25 Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models Apr 16, 2024 image-classification Image Classification
Code Code Available 25 Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 25 Visual Programming: Compositional visual reasoning without training Nov 18, 2022 In-Context Learning Question Answering
Code Code Available 25 VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning Mar 19, 2024 Benchmarking Image Captioning
Code Code Available 25 Med-Flamingo: a Multimodal Medical Few-shot Learner Jul 27, 2023 Medical Visual Question Answering Question Answering
Code Code Available 25 When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning Mar 10, 2025 Language Modeling Language Modelling
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25 CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts May 9, 2024 Image Captioning Instruction Following
Code Code Available 25 LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning Jun 13, 2022 Transfer Learning Visual Question Answering (VQA)
Code Code Available 25 MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 25 Learning to Compose Dynamic Tree Structures for Visual Contexts Dec 5, 2018 Graph Generation Panoptic Scene Graph Generation
Code Code Available 25 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning May 11, 2023 1 Image, 2*2 Stitching Diversity
Code Code Available 25 Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Sep 20, 2022 Multimodal Deep Learning Multimodal Reasoning
Code Code Available 25 CoLLaVO: Crayon Large Language and Vision mOdel Feb 17, 2024 Large Language Model model
Code Code Available 25 Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 25 HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models Oct 23, 2023 Diagnostic Hallucination
Code Code Available 25 GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest Jul 7, 2023 Attribute Common Sense Reasoning
Code Code Available 25 GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering Feb 4, 2024 Language Modeling Language Modelling
Code Code Available 25 GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 25 Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos Jul 23, 2024 Image Generation Point Tracking
Code Code Available 25 KVQ: Kwai Video Quality Assessment for Short-form Videos Feb 11, 2024 Form Video Quality Assessment
Code Code Available 25 Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Sep 29, 2023 Image to text Passage Retrieval
Code Code Available 25 BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Aug 19, 2023 MME Optical Character Recognition (OCR)
Code Code Available 25 GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 25 LM4LV: A Frozen Large Language Model for Low-level Vision Tasks May 24, 2024 Language Modeling Language Modelling
Code Code Available 25 PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents Mar 13, 2023 image-classification Image Classification
Code Code Available 25 CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning Aug 10, 2022 Math Mathematical Reasoning
Code Code Available 15 Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 15 CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations Apr 5, 2022 Explanation Generation Question Answering
Code Code Available 15