| Multi-granularity Correspondence Learning from Long-term Noisy Videos | Jan 30, 2024 | Action SegmentationLong Video Retrieval (Background Removed) | CodeCode Available | 2 |
| TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Dec 4, 2023 | Dense CaptioningHighlight Detection | CodeCode Available | 2 |
| Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives | Nov 30, 2023 | Video Understanding | CodeCode Available | 2 |
| MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Nov 28, 2023 | 3D Question Answering (3D-QA)Diagnostic | CodeCode Available | 2 |
| PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | Nov 22, 2023 | BenchmarkingPhrase Grounding | CodeCode Available | 2 |
| Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding | Nov 14, 2023 | Image-based Generative Performance BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |
| A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | Jul 18, 2023 | Knowledge Distillationobject-detection | CodeCode Available | 2 |
| Valley: Video Assistant with Large Language model Enhanced abilitY | Jun 12, 2023 | Action RecognitionInstruction Following | CodeCode Available | 2 |
| Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | Jun 7, 2023 | Cross-Modal RetrievalLanguage Modelling | CodeCode Available | 2 |
| Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | Mar 24, 2023 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 |
| AIM: Adapting Image Models for Efficient Video Action Recognition | Feb 6, 2023 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Nov 17, 2022 | Video Understanding | CodeCode Available | 2 |
| Temporal Action Segmentation: An Analysis of Modern Techniques | Oct 19, 2022 | Action SegmentationSegmentation | CodeCode Available | 2 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Sep 22, 2022 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| ActionFormer: Localizing Moments of Actions with Transformers | Feb 16, 2022 | Action LocalizationAction Recognition | CodeCode Available | 2 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 |
| Attention Mechanisms in Computer Vision: A Survey | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 2 |
| TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device | Sep 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 2 |
| Video Swin Transformer | Jun 24, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Is Space-Time Attention All You Need for Video Understanding? | Feb 9, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Video Instance Segmentation | May 12, 2019 | Instance SegmentationSegmentation | CodeCode Available | 2 |
| UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks | Jul 15, 2025 | Video CaptioningVideo Understanding | CodeCode Available | 1 |
| MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding | Jul 8, 2025 | Autonomous DrivingVideo Understanding | CodeCode Available | 1 |