| TokenLearner: Adaptive Space-Time Tokenization for Videos | Dec 1, 2021 | Representation LearningVideo Recognition | CodeCode Available | 1 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Relational Self-Attention: What's Missing in Attention for Video Understanding | Nov 2, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Object-Region Video Transformers | Oct 13, 2021 | Action DetectionAction Recognition | CodeCode Available | 1 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 |
| Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization | Aug 14, 2021 | Action LocalizationMultiple Instance Learning | CodeCode Available | 1 |
| AutoVideo: An Automated Video Action Recognition System | Aug 9, 2021 | Action RecognitionAutoML | CodeCode Available | 1 |
| Token Shift Transformer for Video Classification | Aug 5, 2021 | ClassificationComputational Efficiency | CodeCode Available | 1 |
| Elaborative Rehearsal for Zero-shot Action Recognition | Aug 5, 2021 | Action RecognitionFew-Shot Learning | CodeCode Available | 1 |
| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Spatial-Temporal Transformer for Dynamic Scene Graph Generation | Jul 26, 2021 | DecoderScene Graph Generation | CodeCode Available | 1 |
| Disentangle Your Dense Object Detector | Jul 7, 2021 | DisentanglementObject | CodeCode Available | 1 |
| Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection | Jun 28, 2021 | Action RecognitionAction Spotting | CodeCode Available | 1 |
| Can An Image Classifier Suffice For Action Recognition? | Jun 26, 2021 | Action Recognitionimage-classification | CodeCode Available | 1 |
| Towards Long-Form Video Understanding | Jun 21, 2021 | Action RecognitionForm | CodeCode Available | 1 |
| TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | Jun 21, 2021 | Action ClassificationImage Classification | CodeCode Available | 1 |
| VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | Jun 21, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |