| MiniCPM-V: A GPT-4V Level MLLM on Your Phone | Aug 3, 2024 | HallucinationMultiple-choice | CodeCode Available | 12 | 5 |
| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Sep 18, 2024 | Natural Language Visual Grounding | CodeCode Available | 11 | 5 |
| InternVideo2: Scaling Foundation Models for Multimodal Video Understanding | Mar 22, 2024 | Action ClassificationAction Recognition | CodeCode Available | 7 | 5 |
| Qwen2.5-Omni Technical Report | Mar 26, 2025 | Automatic Speech Recognition (ASR)GSM8K | CodeCode Available | 7 | 5 |
| LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models | Jul 10, 2024 | Video Question AnsweringZero-Shot Video Question Answer | CodeCode Available | 7 | 5 |
| Mistral 7B | Oct 10, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 | 5 |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | Apr 28, 2023 | Instruction Followingmodel | CodeCode Available | 5 | 5 |
| VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs | Jun 11, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 5 | 5 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 | 5 |
| VILA: On Pre-training for Visual Language Models | Dec 12, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 4 | 5 |
| MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens | Apr 4, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | Feb 5, 2024 | Science Question AnsweringText-to-Video Generation | CodeCode Available | 4 | 5 |
| mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | Apr 27, 2023 | Visual Question Answering (VQA)Zero-Shot Video Question Answer | CodeCode Available | 4 | 5 |
| Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | Jun 5, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 | 5 |
| Tarsier: Recipes for Training and Evaluating Large Video Description Models | Jun 30, 2024 | Video CaptioningVideo Description | CodeCode Available | 4 | 5 |
| VideoChat: Chat-Centric Video Understanding | May 10, 2023 | Question AnsweringVideo-based Generative Performance Benchmarking | CodeCode Available | 4 | 5 |
| InternVideo: General Video Foundation Models via Generative and Discriminative Learning | Dec 6, 2022 | Action ClassificationAction Recognition | CodeCode Available | 4 | 5 |
| LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token | Jan 7, 2025 | GPUVisual Question Answering (VQA) | CodeCode Available | 4 | 5 |
| Long Context Transfer from Language to Vision | Jun 24, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 4 | 5 |
| Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | Nov 20, 2024 | GPUMME | CodeCode Available | 3 | 5 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 | 5 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 | 5 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | Mar 17, 2025 | Grounded Video Question AnsweringQuestion Answering | CodeCode Available | 3 | 5 |
| LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | Oct 22, 2024 | Token ReductionVideo Question Answering | CodeCode Available | 3 | 5 |
| ViperGPT: Visual Inference via Python Execution for Reasoning | Mar 14, 2023 | Code GenerationVideo Question Answering | CodeCode Available | 3 | 5 |
| SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | Jul 22, 2024 | Language Modeling | CodeCode Available | 3 | 5 |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Jun 8, 2023 | Question AnsweringVCGBench-Diverse | CodeCode Available | 3 | 5 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 | 5 |
| Video ReCap: Recursive Captioning of Hour-Long Videos | Feb 20, 2024 | EgoSchemaVideo Captioning | CodeCode Available | 3 | 5 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 | 5 |
| VideoAgent: Long-form Video Understanding with Large Language Model as Agent | Mar 15, 2024 | EgoSchemaForm | CodeCode Available | 2 | 5 |
| Elysium: Exploring Object-level Perception in Videos via MLLM | Mar 25, 2024 | ObjectObject Tracking | CodeCode Available | 2 | 5 |
| vid-TLDR: Training Free Token merging for Light-weight Video Transformer | Mar 20, 2024 | Action RecognitionComputational Efficiency | CodeCode Available | 2 | 5 |
| LinVT: Empower Your Image-level Large Language Model to Understand Videos | Dec 6, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | Nov 28, 2023 | Image CaptioningQuestion Answering | CodeCode Available | 2 | 5 |
| VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | May 29, 2024 | EgoSchemaMME | CodeCode Available | 2 | 5 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 | 5 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 | 5 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 | 5 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 | 5 |
| An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM | Mar 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 | 5 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 | 5 |
| TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Oct 25, 2024 | EgoSchemaHallucination | CodeCode Available | 2 | 5 |
| Understanding Long Videos with Multimodal Language Models | Mar 25, 2024 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 | 5 |
| Valley: Video Assistant with Large Language model Enhanced abilitY | Jun 12, 2023 | Action RecognitionInstruction Following | CodeCode Available | 2 | 5 |
| CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios | Mar 7, 2024 | Audio-visual Question AnsweringAudio-Visual Question Answering (AVQA) | CodeCode Available | 2 | 5 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 | 5 |