| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| Video-R1: Reinforcing Video Reasoning in MLLMs | Mar 27, 2025 | MVBenchReinforcement Learning (RL) | CodeCode Available | 4 |
| PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning | Apr 25, 2024 | Dense CaptioningMVBench | CodeCode Available | 4 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | Apr 9, 2025 | MVBenchObject Tracking | CodeCode Available | 3 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 |
| LLaVAction: evaluating and training multi-modal large language models for action recognition | Mar 24, 2025 | Action RecognitionAction Understanding | CodeCode Available | 2 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos | Aug 26, 2024 | Large Language ModelMVBench | CodeCode Available | 2 |
| TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | Nov 17, 2024 | MVBenchVideo-based Generative Performance Benchmarking | CodeCode Available | 1 |
| VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding | May 2, 2025 | Anomaly DetectionCommon Sense Reasoning | CodeCode Available | 1 |
| VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment | Apr 18, 2025 | MVBench | —Unverified | 0 |
| VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges | Sep 2, 2024 | GPUMVBench | —Unverified | 0 |
| GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning | May 29, 2025 | Multimodal ReasoningMVBench | —Unverified | 0 |
| VideoSAVi: Self-Aligned Video Language Models without Human Supervision | Dec 1, 2024 | EgoSchemaMVBench | —Unverified | 0 |
| Enhancing Temporal Modeling of Video LLMs via Time Gating | Oct 8, 2024 | MVBenchQuestion Answering | CodeCode Available | 0 |