| Spotting Temporally Precise, Fine-Grained Events in Video | Jul 20, 2022 | Action DetectionAction Spotting | CodeCode Available | 1 |
| Clover: Towards A Unified Video-Language Alignment and Fusion Model | Jul 16, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Is Appearance Free Action Recognition Possible? | Jul 13, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 1 |
| Federated Self-supervised Learning for Video Understanding | Jul 5, 2022 | Action RecognitionFederated Learning | CodeCode Available | 1 |
| ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning | Jun 27, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| REVECA -- Rich Encoder-decoder framework for Video Event CAptioner | Jun 18, 2022 | DecoderSemantic Segmentation | CodeCode Available | 1 |
| Stand-Alone Inter-Frame Attention in Video Models | Jun 14, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| A Simple and Efficient Pipeline to Build an End-to-End Spatial-Temporal Action Detector | Jun 7, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions | May 19, 2022 | Contrastive LearningSelf-Supervised Learning | CodeCode Available | 1 |
| ETAD: Training Action Detection End to End on a Laptop | May 14, 2022 | Action DetectionGPU | CodeCode Available | 1 |
| BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection | May 5, 2022 | Action Detectionobject-detection | CodeCode Available | 1 |
| A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions | Apr 21, 2022 | Action DetectionVideo Understanding | CodeCode Available | 1 |
| Temporal Alignment Networks for Long-term Video | Apr 6, 2022 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| An Empirical Study of End-to-End Temporal Action Detection | Apr 6, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Long Movie Clip Classification with State-Space Video Models | Apr 4, 2022 | ClassificationDecoder | CodeCode Available | 1 |
| SPAct: Self-supervised Privacy Preservation for Action Recognition | Mar 29, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 |
| Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment | Feb 28, 2022 | 3D Action RecognitionAction Analysis | CodeCode Available | 1 |
| Learning Optical Flow with Adaptive Graph Reasoning | Feb 8, 2022 | Motion EstimationOptical Flow Estimation | CodeCode Available | 1 |
| A Dataset for Medical Instructional Video Classification and Question Answering | Jan 30, 2022 | ClassificationQuestion Answering | CodeCode Available | 1 |
| Video Joint Modelling Based on Hierarchical Transformer for Co-summarization | Dec 27, 2021 | RetrievalSupervised Video Summarization | CodeCode Available | 1 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection | Dec 9, 2021 | Boundary DetectionDiversity | CodeCode Available | 1 |
| Prompting Visual-Language Models for Efficient Video Understanding | Dec 8, 2021 | Action RecognitionLanguage Modelling | CodeCode Available | 1 |
| TokenLearner: Adaptive Space-Time Tokenization for Videos | Dec 1, 2021 | Representation LearningVideo Recognition | CodeCode Available | 1 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Relational Self-Attention: What's Missing in Attention for Video Understanding | Nov 2, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Object-Region Video Transformers | Oct 13, 2021 | Action DetectionAction Recognition | CodeCode Available | 1 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 |
| Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization | Aug 14, 2021 | Action LocalizationMultiple Instance Learning | CodeCode Available | 1 |
| AutoVideo: An Automated Video Action Recognition System | Aug 9, 2021 | Action RecognitionAutoML | CodeCode Available | 1 |
| Token Shift Transformer for Video Classification | Aug 5, 2021 | ClassificationComputational Efficiency | CodeCode Available | 1 |
| Elaborative Rehearsal for Zero-shot Action Recognition | Aug 5, 2021 | Action RecognitionFew-Shot Learning | CodeCode Available | 1 |
| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Spatial-Temporal Transformer for Dynamic Scene Graph Generation | Jul 26, 2021 | DecoderScene Graph Generation | CodeCode Available | 1 |
| Disentangle Your Dense Object Detector | Jul 7, 2021 | DisentanglementObject | CodeCode Available | 1 |
| Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection | Jun 28, 2021 | Action RecognitionAction Spotting | CodeCode Available | 1 |
| Can An Image Classifier Suffice For Action Recognition? | Jun 26, 2021 | Action Recognitionimage-classification | CodeCode Available | 1 |
| Towards Long-Form Video Understanding | Jun 21, 2021 | Action RecognitionForm | CodeCode Available | 1 |
| TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | Jun 21, 2021 | Action ClassificationImage Classification | CodeCode Available | 1 |
| VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | Jun 21, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |