| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 | 5 |
| Foundation Models for Video Understanding: A Survey | May 6, 2024 | SurveyVideo Understanding | CodeCode Available | 2 | 5 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 | 5 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 | 5 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 | 5 |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | Nov 6, 2024 | Image ComprehensionStreaming video understanding | CodeCode Available | 2 | 5 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 | 5 |
| Omni-Video: Democratizing Unified Video Understanding and Generation | Jul 8, 2025 | Video GenerationVideo Understanding | CodeCode Available | 2 | 5 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 | 5 |
| Temporal Action Segmentation: An Analysis of Modern Techniques | Oct 19, 2022 | Action SegmentationSegmentation | CodeCode Available | 2 | 5 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 | 5 |
| TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding | Jan 26, 2025 | Video Understanding | CodeCode Available | 2 | 5 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Mar 31, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 2 | 5 |
| FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models | Dec 30, 2024 | Question AnsweringToken Reduction | CodeCode Available | 2 | 5 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 | 5 |
| OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer | Jun 24, 2024 | AI AgentLarge Language Model | CodeCode Available | 2 | 5 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Sep 22, 2022 | Action ClassificationAction Recognition | CodeCode Available | 2 | 5 |
| Beyond MOT: Semantic Multi-Object Tracking | Mar 8, 2024 | Multi-Object TrackingObject | CodeCode Available | 2 | 5 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 | 5 |
| VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI | Oct 15, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 | 5 |
| Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos | Aug 26, 2024 | Large Language ModelMVBench | CodeCode Available | 2 | 5 |
| MMVU: Measuring Expert-Level Multi-Discipline Video Understanding | Jan 21, 2025 | Video Understanding | CodeCode Available | 2 | 5 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 | 5 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 | 5 |