| AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Dec 4, 2024 | Video Understanding | CodeCode Available | 2 | 5 |
| E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | Sep 26, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 2 | 5 |
| TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning | Apr 13, 2025 | Question Answeringreinforcement-learning | CodeCode Available | 2 | 5 |
| SpaceR: Reinforcing MLLMs in Video Spatial Reasoning | Apr 2, 2025 | MMESpatial Reasoning | CodeCode Available | 2 | 5 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 | 5 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 | 5 |
| StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding | Nov 6, 2024 | Image ComprehensionStreaming video understanding | CodeCode Available | 2 | 5 |
| Is Space-Time Attention All You Need for Video Understanding? | Feb 9, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 | 5 |
| InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models | Dec 18, 2024 | Reasoning SegmentationSegmentation | CodeCode Available | 2 | 5 |
| Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | Jan 23, 2025 | SchedulingStreaming video understanding | CodeCode Available | 2 | 5 |
| QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design | May 22, 2025 | CPUGPU | CodeCode Available | 2 | 5 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 | 5 |
| QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension | Mar 11, 2025 | AutoMLDecoder | CodeCode Available | 2 | 5 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 | 5 |
| AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding | Mar 16, 2025 | Video Understanding | CodeCode Available | 2 | 5 |
| Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | Mar 24, 2023 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 | 5 |
| video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Jun 18, 2025 | Audio captioningLarge Language Model | CodeCode Available | 2 | 5 |
| Beyond MOT: Semantic Multi-Object Tracking | Mar 8, 2024 | Multi-Object TrackingObject | CodeCode Available | 2 | 5 |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency | Jun 2, 2025 | reinforcement-learningReinforcement Learning | CodeCode Available | 2 | 5 |
| PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance | Nov 4, 2024 | Caption GenerationMultiple-choice | CodeCode Available | 2 | 5 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 | 5 |
| PruneVid: Visual Token Pruning for Efficient Video Large Language Models | Dec 20, 2024 | Video Understanding | CodeCode Available | 2 | 5 |
| One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory | May 29, 2025 | Contrastive LearningText Retrieval | CodeCode Available | 2 | 5 |
| Adaptive Keyframe Sampling for Long Video Understanding | Jan 1, 2025 | Video Understanding | CodeCode Available | 2 | 5 |
| Omni-Video: Democratizing Unified Video Understanding and Generation | Jul 8, 2025 | Video GenerationVideo Understanding | CodeCode Available | 2 | 5 |
| FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models | Dec 30, 2024 | Question AnsweringToken Reduction | CodeCode Available | 2 | 5 |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 | 5 |
| Re-thinking Temporal Search for Long-Form Video Understanding | Apr 3, 2025 | Computational EfficiencyForm | CodeCode Available | 2 | 5 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 | 5 |
| Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models | Oct 4, 2024 | Dense Video CaptioningSentence | CodeCode Available | 2 | 5 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 | 5 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 | 5 |
| Multi-granularity Correspondence Learning from Long-term Noisy Videos | Jan 30, 2024 | Action SegmentationLong Video Retrieval (Background Removed) | CodeCode Available | 2 | 5 |
| MMVU: Measuring Expert-Level Multi-Discipline Video Understanding | Jan 21, 2025 | Video Understanding | CodeCode Available | 2 | 5 |
| Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model | Mar 27, 2025 | EgoSchemaLanguage Modeling | CodeCode Available | 2 | 5 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 | 5 |
| OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer | Jun 24, 2024 | AI AgentLarge Language Model | CodeCode Available | 2 | 5 |
| LVBench: An Extreme Long Video Understanding Benchmark | Jun 12, 2024 | Decision MakingVideo Understanding | CodeCode Available | 2 | 5 |
| LongVLM: Efficient Long Video Understanding via Large Language Models | Apr 4, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 | 5 |
| LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos | Nov 29, 2024 | Boundary DetectionDense Video Captioning | CodeCode Available | 2 | 5 |
| Dense Connector for MLLMs | May 22, 2024 | Video Understanding | CodeCode Available | 2 | 5 |
| DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark | May 30, 2024 | DeepFake DetectionMamba | CodeCode Available | 2 | 5 |
| LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | Jul 22, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 | 5 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 | 5 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 | 5 |
| Online Video Understanding: OVBench and VideoChat-Online | Dec 31, 2024 | Autonomous DrivingQuestion Answering | CodeCode Available | 2 | 5 |
| Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Apr 3, 2025 | Computational EfficiencyGPU | CodeCode Available | 2 | 5 |
| TRACE: Temporal Grounding Video LLM via Causal Event Modeling | Oct 8, 2024 | Text GenerationVideo Understanding | CodeCode Available | 2 | 5 |
| Boosting Single Image Super-Resolution via Partial Channel Shifting | Jan 1, 2023 | DiversityImage Super-Resolution | CodeCode Available | 1 | 5 |
| Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions | May 19, 2022 | Contrastive LearningSelf-Supervised Learning | CodeCode Available | 1 | 5 |