PaLM-E: An Embodied Multimodal Language Model Mar 6, 2023 Language Modeling Language Modelling
Code Code Available 2OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Feb 25, 2025 Visual Question Answering (VQA)
Code Code Available 2A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding Jul 2, 2024 document understanding Key Information Extraction
Code Code Available 2RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models Jul 6, 2024 Medical Diagnosis RAG
Code Code Available 2SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories Mar 11, 2025 Decision Making Interactive Segmentation
Code Code Available 2Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives Nov 9, 2022 Disentanglement Video Generation
Code Code Available 2OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models May 13, 2023 Key Information Extraction Nutrition
Code Code Available 2Path-RAG: Knowledge-Guided Key Region Retrieval for Open-ended Pathology Visual Question Answering Nov 26, 2024 Prognosis Question Answering
Code Code Available 2Neighbourhood Representative Sampling for Efficient End-to-end Video Quality Assessment Oct 11, 2022 Video Quality Assessment Visual Question Answering (VQA)
Code Code Available 2Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model Mar 8, 2025 Image Quality Assessment Language Modeling
Code Code Available 2MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering May 20, 2024 Benchmarking Question Answering
Code Code Available 2NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results Apr 17, 2024 Form valid
Code Code Available 2CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts May 9, 2024 Image Captioning Instruction Following
Code Code Available 2CoLLaVO: Crayon Large Language and Vision mOdel Feb 17, 2024 Large Language Model model
Code Code Available 2VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment Aug 21, 2024 Video Alignment Video Editing
Code Code Available 2VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Apr 17, 2023 Audio captioning Audio-Video Question Answering (AVQA)
Code Code Available 2NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario May 24, 2023 Autonomous Driving Question Answering
Code Code Available 2PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging Jan 5, 2024 Medical Report Generation Medical Visual Question Answering
Code Code Available 2MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis Jul 4, 2024 Diagnostic Language Modeling
Code Code Available 2MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization Dec 9, 2024 Visual Question Answering (VQA)
Code Code Available 2Med-Flamingo: a Multimodal Medical Few-shot Learner Jul 27, 2023 Medical Visual Question Answering Question Answering
Code Code Available 2Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis Mar 25, 2025 Contrastive Learning Image-text Retrieval
Code Code Available 2VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning Mar 19, 2024 Benchmarking Image Captioning
Code Code Available 2EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Dec 24, 2024 Image Captioning Image Generation
Code Code Available 2When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning Mar 10, 2025 Language Modeling Language Modelling
Code Code Available 2MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis Mar 22, 2024 Medical Diagnosis Medical Visual Question Answering
Code Code Available 2LM4LV: A Frozen Large Language Model for Low-level Vision Tasks May 24, 2024 Language Modeling Language Modelling
Code Code Available 2LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning Jun 13, 2022 Transfer Learning Visual Question Answering (VQA)
Code Code Available 2Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Sep 20, 2022 Multimodal Deep Learning Multimodal Reasoning
Code Code Available 2LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 2MDETR - Modulated Detection for End-to-End Multi-Modal Understanding Jan 1, 2021 Phrase Grounding Question Answering
Code Code Available 2InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning May 11, 2023 1 Image, 2*2 Stitching Diversity
Code Code Available 2HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models Oct 23, 2023 Diagnostic Hallucination
Code Code Available 2Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment Nov 26, 2024 Image Quality Assessment Question Answering
Code Code Available 2BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Aug 19, 2023 MME Optical Character Recognition (OCR)
Code Code Available 2GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest Jul 7, 2023 Attribute Common Sense Reasoning
Code Code Available 2GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Aug 6, 2024 Question Answering Visual Question Answering
Code Code Available 2KVQ: Kwai Video Quality Assessment for Short-form Videos Feb 11, 2024 Form Video Quality Assessment
Code Code Available 2Frontiers in Intelligent Colonoscopy Oct 22, 2024 Image Captioning
Code Code Available 2GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering Feb 4, 2024 Language Modeling Language Modelling
Code Code Available 2Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos Jul 23, 2024 Image Generation Point Tracking
Code Code Available 2GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 2Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering Sep 29, 2023 Image to text Passage Retrieval
Code Code Available 2CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models Jun 11, 2025 counterfactual Descriptive
Code Code Available 2Learning to Compose Dynamic Tree Structures for Visual Contexts Dec 5, 2018 Graph Generation Panoptic Scene Graph Generation
Code Code Available 2Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models Apr 16, 2024 image-classification Image Classification
Code Code Available 2Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 2Evaluating Multimodal Representations on Visual Semantic Textual Similarity Apr 4, 2020 Benchmarking Image Captioning
Code Code Available 1Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Nov 21, 2022 Contrastive Learning Representation Learning
Code Code Available 1ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Oct 12, 2022 document-image-classification Document Image Classification
Code Code Available 1