| STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding | Mar 20, 2025 | Video UnderstandingZero-shot Generalization | CodeCode Available | 1 | 5 |
| EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video | Jan 15, 2020 | DiversityRecommendation Systems | CodeCode Available | 1 | 5 |
| Learning Optical Flow with Adaptive Graph Reasoning | Feb 8, 2022 | Motion EstimationOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 | 5 |
| Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition | Jan 1, 2021 | Action RecognitionVideo Understanding | CodeCode Available | 1 | 5 |
| Compositional Video Understanding with Spatiotemporal Structure-based Transformers | Jan 1, 2024 | Video Understanding | CodeCode Available | 1 | 5 |
| SPAct: Self-supervised Privacy Preservation for Action Recognition | Mar 29, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Language Repository for Long Video Understanding | Mar 21, 2024 | EgoSchemaQuestion Answering | CodeCode Available | 1 | 5 |
| SoccerNet 2023 Challenges Results | Sep 12, 2023 | Action SpottingCamera Calibration | CodeCode Available | 1 | 5 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 | 5 |
| Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | Feb 14, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 | 5 |
| SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos | Nov 26, 2020 | Action SpottingBoundary Detection | CodeCode Available | 1 | 5 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 | 5 |
| SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models | Dec 15, 2023 | Video Understanding | CodeCode Available | 1 | 5 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 | 5 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 | 5 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark | Aug 5, 2024 | Dense Video CaptioningDiversity | CodeCode Available | 1 | 5 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation | Oct 31, 2024 | Action SegmentationAction Understanding | CodeCode Available | 1 | 5 |
| SoccerNet 2022 Challenges Results | Oct 5, 2022 | Action SpottingCamera Calibration | CodeCode Available | 1 | 5 |
| Spatial-Temporal Transformer for Dynamic Scene Graph Generation | Jul 26, 2021 | DecoderScene Graph Generation | CodeCode Available | 1 | 5 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 | 5 |
| SiLVR: A Simple Language-based Video Reasoning Framework | May 30, 2025 | MathMME | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| BehAVE: Behaviour Alignment of Video Game Encodings | Feb 2, 2024 | DiversityFPS Games | CodeCode Available | 1 | 5 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 | 5 |
| A Simple LLM Framework for Long-Range Video Question-Answering | Dec 28, 2023 | EgoSchemaLanguage Modelling | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction | Aug 29, 2023 | Federated Learningimage-classification | CodeCode Available | 1 | 5 |
| A Dataset for Medical Instructional Video Classification and Question Answering | Jan 30, 2022 | ClassificationQuestion Answering | CodeCode Available | 1 | 5 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 | 5 |
| Slot State Space Models | Jun 18, 2024 | MambaState Space Models | CodeCode Available | 1 | 5 |
| SFMViT: SlowFast Meet ViT in Chaotic World | Apr 25, 2024 | Action LocalizationVideo Understanding | CodeCode Available | 1 | 5 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation | Jun 13, 2025 | Anomaly DetectionClustering | CodeCode Available | 1 | 5 |
| CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning | Oct 10, 2019 | DiagnosticObject | CodeCode Available | 1 | 5 |
| CAST: Cross-Attention in Space and Time for Video Action Recognition | Nov 30, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Towards Visually Explaining Video Understanding Networks with Perturbation | May 1, 2020 | Video Understanding | CodeCode Available | 1 | 5 |
| Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos | Dec 16, 2023 | Video Captioningvideo narration captioning | CodeCode Available | 1 | 5 |
| ETAD: Training Action Detection End to End on a Laptop | May 14, 2022 | Action DetectionGPU | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| REVECA -- Rich Encoder-decoder framework for Video Event CAptioner | Jun 18, 2022 | DecoderSemantic Segmentation | CodeCode Available | 1 | 5 |
| EPIC Fields: Marrying 3D Geometry and Video Understanding | Jun 14, 2023 | 3D geometryNeural Rendering | CodeCode Available | 1 | 5 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 | 5 |
| Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis | Apr 12, 2024 | Dense Video CaptioningTransfer Learning | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| Helping Hands: An Object-Aware Ego-Centric Video Recognition Model | Aug 15, 2023 | DecoderObject | CodeCode Available | 1 | 5 |
| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |