| TSM: Temporal Shift Module for Efficient Video Understanding | Nov 20, 2018 | 3D Action RecognitionAction Classification | CodeCode Available | 1 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 |
| EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | Jul 23, 2024 | Re-RankingRetrieval | CodeCode Available | 1 |
| Learning Optical Flow with Adaptive Graph Reasoning | Feb 8, 2022 | Motion EstimationOptical Flow Estimation | CodeCode Available | 1 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 |
| An Empirical Study of End-to-End Temporal Action Detection | Apr 6, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | Nov 19, 2022 | Action RecognitionObject State Change Classification | CodeCode Available | 1 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 |
| Towards Smooth Video Composition | Dec 14, 2022 | Image Generationsingle-image-generation | CodeCode Available | 1 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach | May 10, 2023 | Autonomous VehiclesMonocular Visual Odometry | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition | Jan 1, 2021 | Action RecognitionVideo Understanding | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models | Oct 30, 2024 | Video Understanding | CodeCode Available | 1 |