| LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding | Oct 22, 2024 | Token ReductionVideo Question Answering | CodeCode Available | 3 |
| VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | Jun 13, 2024 | Dense Video CaptioningMVBench | CodeCode Available | 3 |
| LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture | Sep 4, 2024 | GPUMamba | CodeCode Available | 3 |
| Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension | Nov 20, 2024 | GPUMME | CodeCode Available | 3 |
| Valley2: Exploring Multimodal Models with Scalable Vision-Language Design | Jan 10, 2025 | Image CaptioningLanguage Modeling | CodeCode Available | 3 |
| Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Jun 8, 2023 | Question AnsweringVCGBench-Diverse | CodeCode Available | 3 |
| Towards Universal Soccer Video Understanding | Dec 2, 2024 | Action ClassificationSports Understanding | CodeCode Available | 3 |
| MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding | Apr 8, 2024 | GPUMultiple-choice | CodeCode Available | 3 |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | Apr 9, 2025 | MVBenchObject Tracking | CodeCode Available | 3 |
| Video ReCap: Recursive Captioning of Hour-Long Videos | Feb 20, 2024 | EgoSchemaVideo Captioning | CodeCode Available | 3 |
| Flash-VStream: Efficient Real-Time Understanding for Long Video Streams | Jun 30, 2025 | cross-modal alignmentEgoSchema | CodeCode Available | 3 |
| Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams | Jun 12, 2024 | cross-modal alignmentLanguage Modelling | CodeCode Available | 3 |
| Temporal Action Segmentation: An Analysis of Modern Techniques | Oct 19, 2022 | Action SegmentationSegmentation | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability | Nov 27, 2024 | Temporal LocalizationVideo Understanding | CodeCode Available | 2 |
| SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding | Feb 15, 2025 | Question AnsweringStreaming video understanding | CodeCode Available | 2 |
| Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | Jan 23, 2025 | SchedulingStreaming video understanding | CodeCode Available | 2 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 |
| Foundation Models for Video Understanding: A Survey | May 6, 2024 | SurveyVideo Understanding | CodeCode Available | 2 |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | Nov 6, 2024 | Image ComprehensionStreaming video understanding | CodeCode Available | 2 |
| TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Oct 25, 2024 | EgoSchemaHallucination | CodeCode Available | 2 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Mar 31, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 2 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 |
| AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Dec 4, 2024 | Video Understanding | CodeCode Available | 2 |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency | Jun 2, 2025 | reinforcement-learningReinforcement Learning | CodeCode Available | 2 |
| Re-thinking Temporal Search for Long-Form Video Understanding | Apr 3, 2025 | Computational EfficiencyForm | CodeCode Available | 2 |
| QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design | May 22, 2025 | CPUGPU | CodeCode Available | 2 |
| E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | Sep 26, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 2 |
| QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension | Mar 11, 2025 | AutoMLDecoder | CodeCode Available | 2 |
| Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Apr 3, 2025 | Computational EfficiencyGPU | CodeCode Available | 2 |
| SpaceR: Reinforcing MLLMs in Video Spatial Reasoning | Apr 2, 2025 | MMESpatial Reasoning | CodeCode Available | 2 |
| TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding | Jan 26, 2025 | Video Understanding | CodeCode Available | 2 |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 |
| Attention Mechanisms in Computer Vision: A Survey | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 2 |
| Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives | Nov 30, 2023 | Video Understanding | CodeCode Available | 2 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 |
| Omni-Video: Democratizing Unified Video Understanding and Generation | Jul 8, 2025 | Video GenerationVideo Understanding | CodeCode Available | 2 |
| Online Video Understanding: OVBench and VideoChat-Online | Dec 31, 2024 | Autonomous DrivingQuestion Answering | CodeCode Available | 2 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| ActionFormer: Localizing Moments of Actions with Transformers | Feb 16, 2022 | Action LocalizationAction Recognition | CodeCode Available | 2 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 |
| Multi-granularity Correspondence Learning from Long-term Noisy Videos | Jan 30, 2024 | Action SegmentationLong Video Retrieval (Background Removed) | CodeCode Available | 2 |
| PruneVid: Visual Token Pruning for Efficient Video Large Language Models | Dec 20, 2024 | Video Understanding | CodeCode Available | 2 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 |
| Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | Mar 24, 2023 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 |
| AIM: Adapting Image Models for Efficient Video Action Recognition | Feb 6, 2023 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |