| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Sep 18, 2024 | Natural Language Visual Grounding | CodeCode Available | 11 |
| SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | Feb 8, 2024 | BenchmarkingDiversity | CodeCode Available | 7 |
| mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | Aug 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | Jun 11, 2025 | Action AnticipationLarge Language Model | CodeCode Available | 7 |
| LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | Jul 10, 2024 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 7 |
| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 |
| PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | Apr 17, 2025 | Video Question AnsweringVideo Understanding | CodeCode Available | 7 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| NVILA: Efficient Frontier Visual Language Models | Dec 5, 2024 | Video Question Answering | CodeCode Available | 7 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |
| Aria: An Open Multimodal Native Mixture-of-Experts Model | Oct 8, 2024 | Instruction FollowingMixture-of-Experts | CodeCode Available | 5 |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | Apr 28, 2023 | Instruction Followingmodel | CodeCode Available | 5 |
| VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding | Jan 22, 2025 | PhilosophyVideo Question Answering | CodeCode Available | 5 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | Jun 11, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 5 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 |
| MovieChat+: Question-aware Sparse Memory for Long Video Question Answering | Apr 26, 2024 | 2kQuestion Answering | CodeCode Available | 4 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| VideoChat: Chat-Centric Video Understanding | May 10, 2023 | Question AnsweringVideo-based Generative Performance Benchmarking | CodeCode Available | 4 |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Tarsier: Recipes for Training and Evaluating Large Video Description Models | Jun 30, 2024 | Video CaptioningVideo Description | CodeCode Available | 4 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding | Jan 14, 2025 | Embodied Question AnsweringHallucination | CodeCode Available | 4 |
| InternVideo: General Video Foundation Models via Generative and Discriminative Learning | Dec 6, 2022 | Action ClassificationAction Recognition | CodeCode Available | 4 |
| Long Context Transfer from Language to Vision | Jun 24, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 |
| Emu: Generative Pretraining in Multimodality | Jul 11, 2023 | Image CaptioningImage Generation | CodeCode Available | 3 |
| Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution | Sep 19, 2024 | document understandingVideo Question Answering | CodeCode Available | 3 |
| Vision-Language Pre-training: Basics, Recent Advances, and Future Trends | Oct 17, 2022 | Few-Shot LearningImage Captioning | CodeCode Available | 3 |
| Visual Causal Scene Refinement for Video Question Answering | May 7, 2023 | Contrastive LearningQuestion Answering | CodeCode Available | 3 |
| ViperGPT: Visual Inference via Python Execution for Reasoning | Mar 14, 2023 | Code GenerationVideo Question Answering | CodeCode Available | 3 |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Apr 8, 2024 | GPUMultiple-choice | CodeCode Available | 3 |
| LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | Oct 22, 2024 | Token ReductionVideo Question Answering | CodeCode Available | 3 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Mar 17, 2025 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 3 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Jun 8, 2023 | Question AnsweringVCGBench-Diverse | CodeCode Available | 3 |
| ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation | Sep 20, 2024 | DescriptiveQuestion Answering | CodeCode Available | 3 |
| VoCo-LLaMA: Towards Vision Compression with Large Language Models | Jun 18, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 3 |
| Perception Test: A Diagnostic Benchmark for Multimodal Video Models | May 23, 2023 | DiagnosticGrounded Video Question Answering | CodeCode Available | 2 |
| Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs | Oct 14, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 2 |
| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 |
| An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | Mar 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 |
| Online Video Understanding: OVBench and VideoChat-Online | Dec 31, 2024 | Autonomous DrivingQuestion Answering | CodeCode Available | 2 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| FreeVA: Offline MLLM as Training-Free Video Assistant | May 13, 2024 | FairnessQuestion Answering | CodeCode Available | 2 |
| CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models | Jun 11, 2025 | counterfactualDescriptive | CodeCode Available | 2 |