| Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization | Aug 4, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Spatial-Temporal Transformer for Dynamic Scene Graph Generation | Jul 26, 2021 | DecoderScene Graph Generation | CodeCode Available | 1 |
| CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding | Jul 21, 2021 | Question AnsweringSentence | —Unverified | 0 |
| Disentangle Your Dense Object Detector | Jul 7, 2021 | DisentanglementObject | CodeCode Available | 1 |
| Spatio-Temporal Context for Action Detection | Jun 29, 2021 | Action DetectionVideo Understanding | —Unverified | 0 |
| Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection | Jun 28, 2021 | Action RecognitionAction Spotting | CodeCode Available | 1 |
| Can An Image Classifier Suffice For Action Recognition? | Jun 26, 2021 | Action Recognitionimage-classification | CodeCode Available | 1 |
| Video Swin Transformer | Jun 24, 2021 | Action ClassificationAction Recognition | CodeCode Available | 2 |
| TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? | Jun 21, 2021 | Action ClassificationImage Classification | CodeCode Available | 1 |
| Towards Long-Form Video Understanding | Jun 21, 2021 | Action RecognitionForm | CodeCode Available | 1 |
| VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | Jun 21, 2021 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | Jun 19, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| Learning the Predictability of the Future | Jun 19, 2021 | Representation LearningSelf-Supervised Action Recognition | CodeCode Available | 1 |
| Discerning Generic Event Boundaries in Long-Form Wild Videos | Jun 18, 2021 | Boundary DetectionForm | —Unverified | 0 |
| End-to-end Temporal Action Detection with Transformer | Jun 18, 2021 | Action DetectionTemporal Action Localization | CodeCode Available | 1 |
| Long-Short Temporal Contrastive Learning of Video Transformers | Jun 17, 2021 | Action RecognitionContrastive Learning | —Unverified | 0 |
| C^3: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues | Jun 16, 2021 | Contrastive Learningcounterfactual | —Unverified | 0 |
| Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention | Jun 11, 2021 | Action RecognitionSign Language Recognition | CodeCode Available | 1 |
| VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization | Jun 10, 2021 | ArticlesSegmentation | CodeCode Available | 1 |
| Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition | Jun 9, 2021 | Action RecognitionPoint Cloud Classification | —Unverified | 0 |
| Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking | Jun 7, 2021 | Graph Neural NetworkMulti-Person Pose Estimation | —Unverified | 0 |
| Technical Report: Temporal Aggregate Representations | Jun 6, 2021 | Action AnticipationAction Recognition | CodeCode Available | 1 |
| Transformed ROIs for Capturing Visual Transformations in Videos | Jun 6, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| A Study On the Effects of Pre-processing On Spatio-temporal Action Recognition Using Spiking Neural Networks Trained with STDP | May 31, 2021 | Action RecognitionSpatio-temporal Action Recognition | —Unverified | 0 |
| Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis | May 28, 2021 | Multimodal Sentiment AnalysisObject Recognition | —Unverified | 0 |
| FineAction: A Fine-Grained Video Dataset for Temporal Action Localization | May 24, 2021 | Action DetectionAction Localization | CodeCode Available | 1 |
| VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding | May 20, 2021 | Action SegmentationLanguage Modeling | —Unverified | 0 |
| NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions | May 18, 2021 | Question AnsweringVideo Question Answering | CodeCode Available | 1 |
| MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions | May 16, 2021 | Action DetectionAction Localization | CodeCode Available | 1 |
| Relation-aware Hierarchical Attention Framework for Video Question Answering | May 13, 2021 | Question AnsweringRelation | CodeCode Available | 0 |
| Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions | May 10, 2021 | Contrastive LearningRetrieval | —Unverified | 0 |
| Stochastic Image-to-Video Synthesis using cINNs | May 10, 2021 | DiversityVideo Understanding | CodeCode Available | 1 |
| FrameExit: Conditional Early Exiting for Efficient Video Recognition | Apr 27, 2021 | Video RecognitionVideo Understanding | CodeCode Available | 1 |
| Skimming and Scanning for Untrimmed Video Action Recognition | Apr 21, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Temporal Query Networks for Fine-grained Video Understanding | Apr 19, 2021 | Action ClassificationAction Recognition | —Unverified | 0 |
| Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting | Apr 19, 2021 | Action SpottingCamera Calibration | —Unverified | 0 |
| CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | Apr 18, 2021 | RetrievalText Retrieval | CodeCode Available | 1 |
| Temporally smooth online action detection using cycle-consistent future anticipation | Apr 16, 2021 | Action DetectionAutonomous Driving | CodeCode Available | 0 |
| Adaptive Intermediate Representations for Video Understanding | Apr 14, 2021 | Action ClassificationOptical Flow Estimation | —Unverified | 0 |
| Crossover Learning for Fast Online Video Instance Segmentation | Apr 13, 2021 | Instance SegmentationSemantic Segmentation | CodeCode Available | 1 |
| Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation | Apr 10, 2021 | Objectobject-detection | —Unverified | 0 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| TubeR: Tubelet Transformer for Video Action Detection | Apr 2, 2021 | Action ClassificationAction Detection | CodeCode Available | 1 |
| M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers | Apr 2, 2021 | DiagnosticVideo Editing | —Unverified | 0 |
| Visual Semantic Role Labeling for Video Understanding | Apr 2, 2021 | Semantic Role LabelingVideo Recognition | CodeCode Available | 1 |
| Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation | Mar 30, 2021 | Action DetectionTemporal Action Proposal Generation | —Unverified | 0 |
| Unified Graph Structured Models for Video Understanding | Mar 29, 2021 | Action DetectionGraph Classification | —Unverified | 0 |
| Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization | Mar 28, 2021 | Action ClassificationAction Localization | —Unverified | 0 |
| Learning Salient Boundary Feature for Anchor-free Temporal Action Localization | Mar 24, 2021 | Action LocalizationTemporal Action Localization | CodeCode Available | 1 |
| Temporal Context Aggregation Network for Temporal Action Proposal Refinement | Mar 24, 2021 | Action DetectionAction Localization | CodeCode Available | 1 |