| Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection | Dec 9, 2021 | Boundary DetectionDiversity | CodeCode Available | 1 |
| Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search | Dec 9, 2021 | Neural Architecture SearchVideo Recognition | —Unverified | 0 |
| Prompting Visual-Language Models for Efficient Video Understanding | Dec 8, 2021 | Action RecognitionLanguage Modelling | CodeCode Available | 1 |
| Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning | Dec 7, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 0 |
| Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips | Dec 2, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| TokenLearner: Adaptive Space-Time Tokenization for Videos | Dec 1, 2021 | Representation LearningVideo Recognition | CodeCode Available | 1 |
| LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering | Nov 29, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Nov 29, 2021 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| End-to-End Referring Video Object Segmentation with Multimodal Transformers | Nov 29, 2021 | Inductive BiasInstance Segmentation | CodeCode Available | 1 |
| SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning | Nov 25, 2021 | Caption GenerationQuestion Answering | CodeCode Available | 1 |
| VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling | Nov 24, 2021 | Question AnsweringRetrieval | CodeCode Available | 1 |
| MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing | Nov 24, 2021 | audio-visual event localizationVideo Understanding | CodeCode Available | 1 |
| PyTorchVideo: A Deep Learning Library for Video Understanding | Nov 18, 2021 | Deep LearningSelf-Supervised Learning | CodeCode Available | 2 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 |
| Attention Mechanisms in Computer Vision: A Survey | Nov 15, 2021 | image-classificationImage Classification | CodeCode Available | 2 |
| Relational Self-Attention: What's Missing in Attention for Video Understanding | Nov 2, 2021 | Action RecognitionTemporal Action Localization | CodeCode Available | 1 |
| Revisiting spatio-temporal layouts for compositional action recognition | Nov 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Re-ID-AR: Improved Person Re-identification in Video via Joint Weakly Supervised Action Recognition | Nov 1, 2021 | Action RecognitionPerson Re-Identification | CodeCode Available | 0 |
| Gradient Frequency Modulation for Visually Explaining Video Understanding Models | Nov 1, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding | Oct 31, 2021 | Action RecognitionText Detection | —Unverified | 0 |
| Can't Fool Me: Adversarially Robust Transformer for Video Understanding | Oct 26, 2021 | image-classificationImage Classification | —Unverified | 0 |
| Leveraging Local Temporal Information for Multimodal Scene Classification | Oct 26, 2021 | ClassificationScene Classification | —Unverified | 0 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| CLIP4Caption: CLIP for Video Caption | Oct 13, 2021 | DecoderSentence | —Unverified | 0 |
| NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels | Oct 13, 2021 | Action ClassificationSelf-Supervised Learning | CodeCode Available | 0 |
| Object-Region Video Transformers | Oct 13, 2021 | Action DetectionAction Recognition | CodeCode Available | 1 |
| TAda! Temporally-Adaptive Convolutions for Video Understanding | Oct 12, 2021 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| Learning Temporally Causal Latent Processes from General Temporal Data | Oct 11, 2021 | Causal DiscoveryRepresentation Learning | CodeCode Available | 1 |
| Toward a Human-Level Video Understanding Intelligence | Oct 8, 2021 | AI AgentVideo Understanding | —Unverified | 0 |
| Efficient Modelling Across Time of Human Actions and Interactions | Oct 5, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction | Oct 3, 2021 | Action RecognitionRepresentation Learning | —Unverified | 0 |
| IntentVizor: Towards Generic Query Guided Interactive Video Summarization | Sep 30, 2021 | Video SummarizationVideo Understanding | CodeCode Available | 1 |
| OBJECT DYNAMICS DISTILLATION FOR SCENE DECOMPOSITION AND REPRESENTATION | Sep 29, 2021 | ObjectPredict Future Video Frames | —Unverified | 0 |
| Learning Temporally Latent Causal Processes from General Temporal Data | Sep 29, 2021 | Causal DiscoveryDisentanglement | CodeCode Available | 1 |
| TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device | Sep 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 2 |
| Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark | Sep 23, 2021 | Video Understanding | CodeCode Available | 0 |
| Towards High-Quality Temporal Action Detection with Sparse Proposals | Sep 18, 2021 | Action DetectionAvg | CodeCode Available | 1 |
| A Multimodal Sentiment Dataset for Video Recommendation | Sep 17, 2021 | Multimodal Sentiment AnalysisSentiment Analysis | —Unverified | 0 |
| Overview of Tencent Multi-modal Ads Video Understanding Challenge | Sep 16, 2021 | Multi-Label ClassificationMUlTI-LABEL-ClASSIFICATION | —Unverified | 0 |
| Multi-modal Representation Learning for Video Advertisement Content Structuring | Sep 4, 2021 | Representation LearningRe-Ranking | —Unverified | 0 |
| Spatio-Temporal Perturbations for Video Attribution | Sep 1, 2021 | Video Understanding | CodeCode Available | 0 |
| LIGAR: Lightweight General-purpose Action Recognition | Aug 30, 2021 | Action RecognitionGesture Recognition | —Unverified | 0 |
| Identity-aware Graph Memory Network for Action Detection | Aug 26, 2021 | Action DetectionGraph Neural Network | —Unverified | 0 |
| Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization | Aug 14, 2021 | Action LocalizationMultiple Instance Learning | CodeCode Available | 1 |
| AutoVideo: An Automated Video Action Recognition System | Aug 9, 2021 | Action RecognitionAutoML | CodeCode Available | 1 |
| Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection | Aug 8, 2021 | Action DetectionKnowledge Distillation | —Unverified | 0 |
| O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning | Aug 5, 2021 | AttributeCaption Generation | —Unverified | 0 |
| Elaborative Rehearsal for Zero-shot Action Recognition | Aug 5, 2021 | Action RecognitionFew-Shot Learning | CodeCode Available | 1 |
| Token Shift Transformer for Video Classification | Aug 5, 2021 | ClassificationComputational Efficiency | CodeCode Available | 1 |