Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Sep 18, 2024 Natural Language Visual Grounding
Code Code Available 115 Qwen2.5-VL Technical Report Feb 19, 2025 document understanding
Code Code Available 115 SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning Aug 10, 2024 Hallucination Optical Character Recognition
Code Code Available 115 mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Aug 9, 2024 Language Modeling Language Modelling
Code Code Available 75 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Oct 24, 2024 Image Generation Question Generation
Code Code Available 75 MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Apr 20, 2023 Image Description Language Modelling
Code Code Available 75 GPT-4 Technical Report Mar 15, 2023 answerability prediction Arithmetic Reasoning
Code Code Available 65 Improved Baselines with Visual Instruction Tuning Oct 5, 2023 Factual Inconsistency Detection in Chart Captioning Image Classification
Code Code Available 65 VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Jun 11, 2024 Multiple-choice Question Answering
Code Code Available 55 TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document Mar 7, 2024 document understanding Key Information Extraction
Code Code Available 55 Ovis: Structural Embedding Alignment for Multimodal Large Language Model May 31, 2024 Language Modeling Multimodal Large Language Model
Code Code Available 55 CogAgent: A Visual Language Model for GUI Agents Dec 14, 2023 Language Modeling
Code Code Available 55 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Jan 28, 2022 Image Captioning Image-text matching
Code Code Available 55 VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks Jun 12, 2024 Image Generation Language Modeling
Code Code Available 55 LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention Mar 28, 2023 Instruction Following Language Modelling
Code Code Available 55 LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Apr 28, 2023 Instruction Following model
Code Code Available 55 CogVLM: Visual Expert for Pretrained Language Models Nov 6, 2023 1 Image, 2*2 Stitching FS-MEVQA
Code Code Available 55 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Aug 24, 2023 Chart Question Answering FS-MEVQA
Code Code Available 55 OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Aug 2, 2023 Visual Question Answering Visual Question Answering (VQA)
Code Code Available 45 Otter: A Multi-Modal Model with In-Context Instruction Tuning May 5, 2023 GPU In-Context Learning
Code Code Available 45 GLIPv2: Unifying Localization and Vision-Language Understanding Jun 12, 2022 2D Object Detection Contrastive Learning
Code Code Available 45 mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Apr 27, 2023 Visual Question Answering (VQA) Zero-Shot Video Question Answer
Code Code Available 45 mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Nov 7, 2023 1 Image, 2*2 Stitching Decoder
Code Code Available 45 OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM Feb 14, 2024 Medical Visual Question Answering Question Answering
Code Code Available 45 Exploring the Capabilities of Large Multimodal Models on Dense Text May 9, 2024 Prompt Engineering Visual Question Answering (VQA)
Code Code Available 45 OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning May 2, 2024 Autonomous Driving counterfactual
Code Code Available 45 Flamingo: a Visual Language Model for Few-Shot Learning Apr 29, 2022 Few-Shot Learning Generative Visual Question Answering
Code Code Available 45 Multi-label Cluster Discrimination for Visual Representation Learning Jul 24, 2024 Contrastive Learning Image-text Retrieval
Code Code Available 45 Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Nov 16, 2023 Language Modeling Language Modelling
Code Code Available 45 Tarsier: Recipes for Training and Evaluating Large Video Description Models Jun 30, 2024 Video Captioning Video Description
Code Code Available 45 Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Feb 5, 2024 Science Question Answering Text-to-Video Generation
Code Code Available 45 LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Jan 7, 2025 GPU Visual Question Answering (VQA)
Code Code Available 45 SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Nov 13, 2023 Described Object Detection Language Modeling
Code Code Available 45 InternVideo: General Video Foundation Models via Generative and Discriminative Learning Dec 6, 2022 Action Classification Action Recognition
Code Code Available 45 mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Feb 1, 2023 Action Classification Image Classification
Code Code Available 45 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Jan 30, 2023 Generative Visual Question Answering Image Captioning
Code Code Available 45 Long Context Transfer from Language to Vision Jun 24, 2024 Language Modeling Language Modelling
Code Code Available 45 HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale Jun 27, 2024 Visual Question Answering (VQA)
Code Code Available 35 ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities May 18, 2023 1 Image, 2*2 Stitchi Action Classification
Code Code Available 35 Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent Nov 5, 2024 Benchmarking Hallucination
Code Code Available 35 All You May Need for VQA are Image Captions May 4, 2022 All Image Captioning
Code Code Available 35 Emu: Generative Pretraining in Multimodality Jul 11, 2023 Image Captioning Image Generation
Code Code Available 35 Evaluating Text-to-Visual Generation with Image-to-Text Generation Apr 1, 2024 Image to text Question Answering
Code Code Available 35 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models Oct 16, 2024 Diagnostic Hallucination
Code Code Available 35 MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making Apr 22, 2024 Decision Making Medical Diagnosis
Code Code Available 35 MMSearch-R1: Incentivizing LMMs to Search Jun 25, 2025 RAG Retrieval-augmented Generation
Code Code Available 35 DriveLM: Driving with Graph Visual Question Answering Dec 21, 2023 Autonomous Driving Question Answering
Code Code Available 35 Ludwig: a type-based declarative deep learning toolbox Sep 17, 2019 Decoder Deep Learning
Code Code Available 35 Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Dec 12, 2024 EgoSchema
Code Code Available 35 OCR-free Document Understanding Transformer Nov 30, 2021 Document Image Classification document understanding
Code Code Available 35