| End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning | Sep 27, 2023 | Action RecognitionAction Segmentation | CodeCode Available | 1 | 5 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 | 5 |
| Language-Guided Audio-Visual Learning for Long-Term Sports Assessment | Jan 1, 2025 | audio-visual learningKnowledge Graphs | CodeCode Available | 1 | 5 |
| Learning Self-Similarity in Space and Time as a Generalized Motion for Action Recognition | Jan 1, 2021 | Action RecognitionVideo Understanding | CodeCode Available | 1 | 5 |
| Towards Event-oriented Long Video Understanding | Jun 20, 2024 | Video Understanding | CodeCode Available | 1 | 5 |
| IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs | Apr 21, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 | 5 |
| Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition | Feb 14, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 | 5 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 | 5 |
| An overview on the evaluated video retrieval tasks at TRECVID 2022 | Jun 22, 2023 | Ad-hoc video searchRetrieval | CodeCode Available | 1 | 5 |
| TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | Jun 21, 2021 | Action ClassificationImage Classification | CodeCode Available | 1 | 5 |
| Open-Vocabulary Video Relation Extraction | Dec 25, 2023 | Action ClassificationAction Understanding | CodeCode Available | 1 | 5 |
| InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges | Nov 17, 2022 | Future Hand PredictionMoment Queries | CodeCode Available | 1 | 5 |
| Panoptic Video Scene Graph Generation | Nov 28, 2023 | Graph GenerationPanoptic Scene Graph Generation | CodeCode Available | 1 | 5 |
| A Comprehensive Study of Deep Video Action Recognition | Dec 11, 2020 | Action RecognitionDeep Learning | CodeCode Available | 1 | 5 |
| PAN: Towards Fast Action Recognition via Learning Persistence of Appearance | Aug 8, 2020 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Elaborative Rehearsal for Zero-shot Action Recognition | Aug 5, 2021 | Action RecognitionFew-Shot Learning | CodeCode Available | 1 | 5 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 | 5 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 | 5 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| Token Shift Transformer for Video Classification | Aug 5, 2021 | ClassificationComputational Efficiency | CodeCode Available | 1 | 5 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 | 5 |
| -Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation | Jan 31, 2025 | Question AnsweringVideo Question Answering | CodeCode Available | 1 | 5 |
| EgoTaskQA: Understanding Human Tasks in Egocentric Videos | Oct 8, 2022 | Action Localizationcounterfactual | CodeCode Available | 1 | 5 |
| EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos | May 30, 2024 | Action RecognitionSurgical phase recognition | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| Can An Image Classifier Suffice For Action Recognition? | Jun 26, 2021 | Action Recognitionimage-classification | CodeCode Available | 1 | 5 |
| Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding | Jul 11, 2024 | EEGLanguage Modeling | CodeCode Available | 1 | 5 |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | Dec 4, 2024 | Multimodal Large Language ModelVideo Understanding | CodeCode Available | 1 | 5 |
| Crossover Learning for Fast Online Video Instance Segmentation | Apr 13, 2021 | Instance SegmentationSemantic Segmentation | CodeCode Available | 1 | 5 |
| TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition | Mar 28, 2023 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 | 5 |
| EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval | Jul 23, 2024 | Re-RankingRetrieval | CodeCode Available | 1 | 5 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 | 5 |
| An Empirical Study of End-to-End Temporal Action Detection | Apr 6, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 | 5 |
| How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation | Dec 12, 2023 | Anomaly DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| Large Scale Holistic Video Understanding | Apr 25, 2019 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models | Jul 9, 2023 | Question AnsweringTGIF-Frame | CodeCode Available | 1 | 5 |
| EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | Nov 19, 2022 | Action RecognitionObject State Change Classification | CodeCode Available | 1 | 5 |
| SoccerNet 2022 Challenges Results | Oct 5, 2022 | Action SpottingCamera Calibration | CodeCode Available | 1 | 5 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 | 5 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives | Feb 4, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer | Apr 29, 2023 | DecoderHighlight Detection | CodeCode Available | 1 | 5 |
| TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos | Mar 9, 2025 | Action LocalizationBoundary Detection | CodeCode Available | 1 | 5 |
| CyberV: Cybernetics for Test-time Scaling in Video Understanding | Jun 9, 2025 | Video Understanding | CodeCode Available | 1 | 5 |
| Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning | Nov 27, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 | 5 |
| TokenLearner: Adaptive Space-Time Tokenization for Videos | Dec 1, 2021 | Representation LearningVideo Recognition | CodeCode Available | 1 | 5 |
| Towards Long-Form Video Understanding | Jun 21, 2021 | Action RecognitionForm | CodeCode Available | 1 | 5 |
| VideoMamba: Spatio-Temporal Selective State Space Model | Jul 11, 2024 | Mambamodel | CodeCode Available | 1 | 5 |