| Do Language Models Understand Time? | Dec 18, 2024 | Action RecognitionAnomaly Detection | CodeCode Available | 1 | 5 |
| PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos | Dec 2, 2024 | Question AnsweringVideo Understanding | CodeCode Available | 1 | 5 |
| Large Scale Holistic Video Understanding | Apr 25, 2019 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| Contrastive Masked Autoencoders for Self-Supervised Video Hashing | Nov 21, 2022 | DecoderRetrieval | CodeCode Available | 1 | 5 |
| LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | May 20, 2025 | Caption GenerationRetrieval | CodeCode Available | 1 | 5 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 | 5 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 | 5 |
| PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | Aug 8, 2020 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization | Aug 12, 2024 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| Event-Free Moving Object Segmentation from Moving Ego Vehicle | Apr 28, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 1 | 5 |
| Panoramic Vision Transformer for Saliency Detection in 360° Videos | Sep 19, 2022 | Saliency DetectionSaliency Prediction | CodeCode Available | 1 | 5 |
| Dual-path Adaptation from Image to Video Transformers | Mar 17, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer | Apr 29, 2023 | DecoderHighlight Detection | CodeCode Available | 1 | 5 |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning | Jan 13, 2025 | Causal DiscoveryCausal Inference | CodeCode Available | 1 | 5 |
| A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector | Jun 7, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 | 5 |
| ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | Jun 27, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| MMAD: Multi-label Micro-Action Detection in Videos | Jul 7, 2024 | Action AnalysisAction Detection | CodeCode Available | 1 | 5 |
| Grounded Question-Answering in Long Egocentric Videos | Dec 11, 2023 | Video GroundingVideo Question Answering | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| Panoptic Video Scene Graph Generation | Nov 28, 2023 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 1 | 5 |
| PAVE: Patching and Adapting Video Large Language Models | Mar 25, 2025 | Audio-visual Question AnsweringMulti-Task Learning | CodeCode Available | 1 | 5 |
| Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding | Jul 30, 2022 | point cloud video understandingVideo Understanding | CodeCode Available | 1 | 5 |