| VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI | Oct 15, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs | Oct 14, 2024 | Computational EfficiencyQuestion Answering | CodeCode Available | 2 |
| TRACE: Temporal Grounding Video LLM via Causal Event Modeling | Oct 8, 2024 | Text GenerationVideo Understanding | CodeCode Available | 2 |
| Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models | Oct 4, 2024 | Dense Video CaptioningSentence | CodeCode Available | 2 |
| E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding | Sep 26, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 2 |
| Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos | Aug 26, 2024 | Large Language ModelMVBench | CodeCode Available | 2 |
| LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding | Jul 22, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision | Jul 8, 2024 | Action Quality AssessmentDescriptive | CodeCode Available | 2 |
| OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer | Jun 24, 2024 | AI AgentLarge Language Model | CodeCode Available | 2 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| LVBench: An Extreme Long Video Understanding Benchmark | Jun 12, 2024 | Decision MakingVideo Understanding | CodeCode Available | 2 |
| Vript: A Video Is Worth Thousands of Words | Jun 10, 2024 | Video CaptioningVideo Understanding | CodeCode Available | 2 |
| DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark | May 30, 2024 | DeepFake DetectionMamba | CodeCode Available | 2 |
| VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | May 29, 2024 | EgoSchemaMME | CodeCode Available | 2 |
| Dense Connector for MLLMs | May 22, 2024 | Video Understanding | CodeCode Available | 2 |
| Vision Mamba: A Comprehensive Survey and Taxonomy | May 7, 2024 | MambaMedical Image Analysis | CodeCode Available | 2 |
| Foundation Models for Video Understanding: A Survey | May 6, 2024 | SurveyVideo Understanding | CodeCode Available | 2 |
| Leveraging Temporal Contextualization for Video Action Recognition | Apr 15, 2024 | Action RecognitionTemporal Action Localization | CodeCode Available | 2 |
| LongVLM: Efficient Long Video Understanding via Large Language Models | Apr 4, 2024 | Question AnsweringVideo Question Answering | CodeCode Available | 2 |
| ST-LLM: Large Language Models Are Effective Temporal Learners | Mar 30, 2024 | MVBenchReading Comprehension | CodeCode Available | 2 |
| OmniVid: A Generative Framework for Universal Video Understanding | Mar 26, 2024 | Action RecognitionDecoder | CodeCode Available | 2 |
| Understanding Long Videos with Multimodal Language Models | Mar 25, 2024 | Action RecognitionFine-grained Action Recognition | CodeCode Available | 2 |
| VideoAgent: Long-form Video Understanding with Large Language Model as Agent | Mar 15, 2024 | EgoSchemaForm | CodeCode Available | 2 |
| Beyond MOT: Semantic Multi-Object Tracking | Mar 8, 2024 | Multi-Object TrackingObject | CodeCode Available | 2 |
| Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning | Feb 9, 2024 | Active LearningVideo Classification | CodeCode Available | 2 |
| Multi-granularity Correspondence Learning from Long-term Noisy Videos | Jan 30, 2024 | Action SegmentationLong Video Retrieval (Background Removed) | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives | Nov 30, 2023 | Video Understanding | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 |
| Valley: Video Assistant with Large Language model Enhanced abilitY | Jun 12, 2023 | Action RecognitionInstruction Following | CodeCode Available | 2 |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | Jun 7, 2023 | Cross-Modal RetrievalLanguage Modelling | CodeCode Available | 2 |
| Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | Mar 24, 2023 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 |
| AIM: Adapting Image Models for Efficient Video Action Recognition | Feb 6, 2023 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Nov 17, 2022 | Video Understanding | CodeCode Available | 2 |
| Temporal Action Segmentation: An Analysis of Modern Techniques | Oct 19, 2022 | Action SegmentationSegmentation | CodeCode Available | 2 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Sep 22, 2022 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| ActionFormer: Localizing Moments of Actions with Transformers | Feb 16, 2022 | Action LocalizationAction Recognition | CodeCode Available | 2 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 |
| Attention Mechanisms in Computer Vision: A Survey | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 2 |
| TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device | Sep 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 2 |
| Video Swin Transformer | Jun 24, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Is Space-Time Attention All You Need for Video Understanding? | Feb 9, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Video Instance Segmentation | May 12, 2019 | Instance SegmentationSegmentation | CodeCode Available | 2 |
| UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks | Jul 15, 2025 | Video CaptioningVideo Understanding | CodeCode Available | 1 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 |