Qwen2.5-VL Technical Report Feb 19, 2025 document understanding
Code Code Available 11Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Sep 18, 2024 Natural Language Visual Grounding
Code Code Available 11SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning Aug 10, 2024 Hallucination Optical Character Recognition
Code Code Available 11Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Oct 24, 2024 Image Generation Question Generation
Code Code Available 7mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Aug 9, 2024 Language Modeling Language Modelling
Code Code Available 7MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Apr 20, 2023 Image Description Language Modelling
Code Code Available 7Improved Baselines with Visual Instruction Tuning Oct 5, 2023 Factual Inconsistency Detection in Chart Captioning Image Classification
Code Code Available 6GPT-4 Technical Report Mar 15, 2023 answerability prediction Arithmetic Reasoning
Code Code Available 6VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks Jun 12, 2024 Image Generation Language Modeling
Code Code Available 5VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Jun 11, 2024 Multiple-choice Question Answering
Code Code Available 5Ovis: Structural Embedding Alignment for Multimodal Large Language Model May 31, 2024 Language Modeling Multimodal Large Language Model
Code Code Available 5TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document Mar 7, 2024 document understanding Key Information Extraction
Code Code Available 5CogAgent: A Visual Language Model for GUI Agents Dec 14, 2023 Language Modeling
Code Code Available 5CogVLM: Visual Expert for Pretrained Language Models Nov 6, 2023 1 Image, 2*2 Stitching FS-MEVQA
Code Code Available 5Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 5LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Apr 28, 2023 Instruction Following model
Code Code Available 5LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Mar 28, 2023 Instruction Following Language Modelling
Code Code Available 5BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Jan 28, 2022 Image Captioning Image-text matching
Code Code Available 5LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Jan 7, 2025 GPU Visual Question Answering (VQA)
Code Code Available 4Multi-label Cluster Discrimination for Visual Representation Learning Jul 24, 2024 Contrastive Learning Image-text Retrieval
Code Code Available 4Tarsier: Recipes for Training and Evaluating Large Video Description Models Jun 30, 2024 Video Captioning Video Description
Code Code Available 4Long Context Transfer from Language to Vision Jun 24, 2024 Language Modeling Language Modelling
Code Code Available 4Exploring the Capabilities of Large Multimodal Models on Dense Text May 9, 2024 Prompt Engineering Visual Question Answering (VQA)
Code Code Available 4OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning May 2, 2024 Autonomous Driving counterfactual
Code Code Available 4OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM Feb 14, 2024 Medical Visual Question Answering Question Answering
Code Code Available 4Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Feb 5, 2024 Science Question Answering Text-to-Video Generation
Code Code Available 4Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Nov 16, 2023 Language Modeling Language Modelling
Code Code Available 4SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Nov 13, 2023 Described Object Detection Language Modeling
Code Code Available 4mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Nov 7, 2023 1 Image, 2*2 Stitching Decoder
Code Code Available 4OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Aug 2, 2023 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 4Otter: A Multi-Modal Model with In-Context Instruction Tuning May 5, 2023 GPU In-Context Learning
Code Code Available 4mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Apr 27, 2023 Visual Question Answering (VQA) Zero-Shot Video Question Answer
Code Code Available 4mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 4InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 4GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 4Flamingo: a Visual Language Model for Few-Shot Learning Apr 29, 2022 Few-Shot Learning Generative Visual Question Answering
Code Code Available 4MMSearch-R1: Incentivizing LMMs to Search Jun 25, 2025 RAG Retrieval-augmented Generation
Code Code Available 3An Empirical Study on Prompt Compression for Large Language Models Apr 24, 2025 Articles Math
Code Code Available 3Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Dec 12, 2024 EgoSchema
Code Code Available 3Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 3Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent Nov 5, 2024 Benchmarking Hallucination
Code Code Available 3MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models Oct 16, 2024 Diagnostic Hallucination
Code Code Available 3MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Aug 6, 2024 Medical Visual Question Answering Organ Detection
Code Code Available 3HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Jun 27, 2024 Visual Question Answering (VQA)
Code Code Available 3AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models Jun 16, 2024 Hallucination Hallucination Evaluation
Code Code Available 3MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making Apr 22, 2024 Decision Making Medical Diagnosis
Code Code Available 3MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Apr 8, 2024 GPU Multiple-choice
Code Code Available 3Evaluating Text-to-Visual Generation with Image-to-Text Generation Apr 1, 2024 Image to text Question Answering
Code Code Available 3Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning Mar 25, 2024 Visual Question Answering (VQA)
Code Code Available 3