| DOAD: Decoupled One Stage Action Detection Network | Apr 1, 2023 | Action DetectionAction Recognition | —Unverified | 0 |
| Procedure-Aware Pretraining for Instructional Video Understanding | Mar 31, 2023 | Video Understanding | CodeCode Available | 1 |
| Whether and When does Endoscopy Domain Pretraining Make Sense? | Mar 30, 2023 | Action Triplet DetectionSurgical phase recognition | CodeCode Available | 1 |
| Streaming Video Model | Mar 30, 2023 | Action RecognitionDecoder | CodeCode Available | 1 |
| TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition | Mar 28, 2023 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| System-status-aware Adaptive Network for Online Streaming Video Understanding | Mar 28, 2023 | Streaming video understandingVideo Understanding | —Unverified | 0 |
| Selective Structured State-Spaces for Long-Form Video Understanding | Mar 25, 2023 | Contrastive LearningForm | —Unverified | 0 |
| Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | Mar 24, 2023 | Highlight DetectionMoment Retrieval | CodeCode Available | 2 |
| Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos | Mar 22, 2023 | Representation LearningSentence | CodeCode Available | 1 |
| Leaping Into Memories: Space-Time Deep Feature Synthesis | Mar 17, 2023 | DiversityVideo Understanding | CodeCode Available | 0 |
| Dual-path Adaptation from Image to Video Transformers | Mar 17, 2023 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization | Mar 16, 2023 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| Localizing Moments in Long Video Via Multimodal Guidance | Feb 26, 2023 | Natural Language Moment RetrievalNatural Language Visual Grounding | CodeCode Available | 1 |
| Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks | Feb 24, 2023 | ClassificationData Augmentation | —Unverified | 0 |
| MINOTAUR: Multi-task Video Grounding From Multimodal Queries | Feb 16, 2023 | Action DetectionSentence | CodeCode Available | 0 |
| AIM: Adapting Image Models for Efficient Video Action Recognition | Feb 6, 2023 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| Semi-Parametric Video-Grounded Text Generation | Jan 27, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Building Scalable Video Understanding Benchmarks through Sports | Jan 17, 2023 | Video Understanding | —Unverified | 0 |
| STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition | Jan 8, 2023 | Action RecognitionFacial Expression Recognition (FER) | —Unverified | 0 |
| Test of Time: Instilling Video-Language Models with a Sense of Time | Jan 5, 2023 | Video-Text RetrievalVideo Understanding | CodeCode Available | 1 |
| EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding | Jan 5, 2023 | Video Understanding | —Unverified | 0 |
| Multimodal High-order Relation Transformer for Scene Boundary Detection | Jan 1, 2023 | Boundary DetectionDecoder | —Unverified | 0 |
| PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval | Jan 1, 2023 | Representation LearningRetrieval | —Unverified | 0 |
| UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding | Jan 1, 2023 | Video Understanding | —Unverified | 0 |
| Boosting Single Image Super-Resolution via Partial Channel Shifting | Jan 1, 2023 | DiversityImage Super-Resolution | CodeCode Available | 1 |
| Inverse Compositional Learning for Weakly-supervised Relation Grounding | Jan 1, 2023 | RelationVideo Understanding | —Unverified | 0 |
| Self-Supervised Object Detection from Egocentric Videos | Jan 1, 2023 | Class-agnostic Object DetectionObject | —Unverified | 0 |
| Relational Space-Time Query in Long-Form Videos | Jan 1, 2023 | FormVideo Understanding | —Unverified | 0 |
| Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning | Jan 1, 2023 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Few-Shot Referring Relationships in Videos | Jan 1, 2023 | ObjectRelation Network | CodeCode Available | 0 |
| Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction | Dec 28, 2022 | Data AugmentationFace Swapping | —Unverified | 0 |
| Inductive Attention for Video Action Anticipation | Dec 17, 2022 | Action AnticipationAction Recognition | —Unverified | 0 |
| Towards Smooth Video Composition | Dec 14, 2022 | Image Generationsingle-image-generation | CodeCode Available | 1 |
| Egocentric Video Task Translation | Dec 13, 2022 | Multi-Task LearningTranslation | —Unverified | 0 |
| Contextual Explainable Video Representation: Human Perception-based Understanding | Dec 12, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 |
| PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data | Dec 8, 2022 | Action RecognitionPrompt Learning | —Unverified | 0 |
| Transition Is a Process: Pair-to-Video Change Detection Networks for Very High Resolution Remote Sensing Images | Dec 7, 2022 | Building change detection for remote sensing imagesChange Detection | —Unverified | 0 |
| InternVideo: General Video Foundation Models via Generative and Discriminative Learning | Dec 6, 2022 | Action ClassificationAction Recognition | CodeCode Available | 4 |
| Spatio-Temporal Crop Aggregation for Video Representation Learning | Nov 30, 2022 | Action ClassificationDimensionality Reduction | —Unverified | 0 |
| MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing | Nov 28, 2022 | Activity RecognitionFew Shot Action Recognition | CodeCode Available | 1 |
| Dynamic Appearance: A Video Representation for Action Recognition with Joint Training | Nov 23, 2022 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Contrastive Masked Autoencoders for Self-Supervised Video Hashing | Nov 21, 2022 | DecoderRetrieval | CodeCode Available | 1 |
| A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset | Nov 19, 2022 | Common Sense ReasoningGraph Embedding | —Unverified | 0 |
| EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens | Nov 19, 2022 | Action RecognitionObject State Change Classification | CodeCode Available | 1 |
| Masked Autoencoders for Egocentric Video Understanding @ Ego4D Challenge 2022 | Nov 18, 2022 | Object State Change ClassificationTemporal Localization | CodeCode Available | 0 |
| InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges | Nov 17, 2022 | Future Hand PredictionMoment Queries | CodeCode Available | 1 |
| UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer | Nov 17, 2022 | Video Understanding | CodeCode Available | 2 |
| Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022 | Nov 16, 2022 | Human-Object Interaction DetectionObject | —Unverified | 0 |
| Grounded Video Situation Recognition | Oct 19, 2022 | DescriptiveStructured Prediction | —Unverified | 0 |
| VTC: Improving Video-Text Retrieval with User Comments | Oct 19, 2022 | Representation LearningRetrieval | CodeCode Available | 1 |