| Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection | Dec 9, 2021 | Boundary DetectionDiversity | CodeCode Available | 1 |
| Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search | Dec 9, 2021 | Neural Architecture SearchVideo Recognition | —Unverified | 0 |
| Prompting Visual-Language Models for Efficient Video Understanding | Dec 8, 2021 | Action RecognitionLanguage Modelling | CodeCode Available | 1 |
| Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning | Dec 7, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 0 |
| Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips | Dec 2, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| TokenLearner: Adaptive Space-Time Tokenization for Videos | Dec 1, 2021 | Representation LearningVideo Recognition | CodeCode Available | 1 |
| LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering | Nov 29, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Nov 29, 2021 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 |
| Attention Mechanisms in Computer Vision: A Survey | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 2 |
| Relational Self-Attention: What's Missing in Attention for Video Understanding | Nov 2, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Re-ID-AR: Improved Person Re-identification in Video via Joint Weakly Supervised Action Recognition | Nov 1, 2021 | Action RecognitionPerson Re-Identification | CodeCode Available | 0 |
| Gradient Frequency Modulation for Visually Explaining Video Understanding Models | Nov 1, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding | Oct 31, 2021 | Action RecognitionText Detection | —Unverified | 0 |
| Can't Fool Me: Adversarially Robust Transformer for Video Understanding | Oct 26, 2021 | image-classificationImage Classification | —Unverified | 0 |
| Leveraging Local Temporal Information for Multimodal Scene Classification | Oct 26, 2021 | ClassificationScene Classification | —Unverified | 0 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| CLIP4Caption: CLIP for Video Caption | Oct 13, 2021 | DecoderSentence | —Unverified | 0 |