| Elaborative Rehearsal for Zero-shot Action Recognition | Aug 5, 2021 | Action RecognitionFew-Shot Learning | CodeCode Available | 1 |
| Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions | May 19, 2022 | Contrastive LearningSelf-Supervised Learning | CodeCode Available | 1 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 |
| Learning Transferable Spatiotemporal Representations from Natural Script Knowledge | Sep 30, 2022 | DescriptiveRepresentation Learning | CodeCode Available | 1 |
| Spatial-Temporal Transformer for Dynamic Scene Graph Generation | Jul 26, 2021 | DecoderScene Graph Generation | CodeCode Available | 1 |
| Spatio-temporal Prompting Network for Robust Video Feature Extraction | Feb 4, 2024 | Instance Segmentationobject-detection | CodeCode Available | 1 |
| SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos | Apr 6, 2024 | Graph GenerationRelation | CodeCode Available | 1 |
| From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities | Jan 10, 2025 | Human-Object Interaction DetectionKnowledge Distillation | CodeCode Available | 1 |
| Learning the Predictability of the Future | Jun 19, 2021 | Representation LearningSelf-Supervised Action Recognition | CodeCode Available | 1 |
| From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding | Sep 27, 2024 | Video UnderstandingVisual Reasoning | CodeCode Available | 1 |
| EgoTaskQA: Understanding Human Tasks in Egocentric Videos | Oct 8, 2022 | Action Localizationcounterfactual | CodeCode Available | 1 |
| STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding | Mar 20, 2025 | Video UnderstandingZero-shot Generalization | CodeCode Available | 1 |
| EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos | May 30, 2024 | Action RecognitionSurgical phase recognition | CodeCode Available | 1 |
| Leveraging triplet loss for unsupervised action segmentation | Apr 13, 2023 | Action SegmentationClustering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs | Nov 29, 2024 | Data AugmentationDiversity | CodeCode Available | 1 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 |
| Can An Image Classifier Suffice For Action Recognition? | Jun 26, 2021 | Action Recognitionimage-classification | CodeCode Available | 1 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Teaching VLMs to Localize Specific Objects from In-context Examples | Nov 20, 2024 | ObjectObject Tracking | CodeCode Available | 1 |
| Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | Feb 14, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Temporal Aggregate Representations for Long-Range Video Understanding | Jun 1, 2020 | Action AnticipationAction Recognition | CodeCode Available | 1 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |
| Lightweight Network Architecture for Real-Time Action Recognition | May 21, 2019 | Action RecognitionCPU | CodeCode Available | 1 |
| TSM: Temporal Shift Module for Efficient Video Understanding | Nov 20, 2018 | 3D Action RecognitionAction Classification | CodeCode Available | 1 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 |
| EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | Jul 23, 2024 | Re-RankingRetrieval | CodeCode Available | 1 |
| Learning Optical Flow with Adaptive Graph Reasoning | Feb 8, 2022 | Motion EstimationOptical Flow Estimation | CodeCode Available | 1 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 |
| An Empirical Study of End-to-End Temporal Action Detection | Apr 6, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 |
| EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | Nov 19, 2022 | Action RecognitionObject State Change Classification | CodeCode Available | 1 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 |
| Towards Smooth Video Composition | Dec 14, 2022 | Image Generationsingle-image-generation | CodeCode Available | 1 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach | May 10, 2023 | Autonomous VehiclesMonocular Visual Odometry | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition | Jan 1, 2021 | Action RecognitionVideo Understanding | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models | Oct 30, 2024 | Video Understanding | CodeCode Available | 1 |