| CVNets: High Performance Library for Computer Vision | Jun 4, 2022 | Video UnderstandingVocal Bursts Intensity Prediction | CodeCode Available | 6 |
| Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding | Jun 1, 2022 | Knowledge GraphsVideo Understanding | —Unverified | 0 |
| From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering | May 30, 2022 | counterfactualDescriptive | CodeCode Available | 1 |
| Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions | May 19, 2022 | Contrastive LearningSelf-Supervised Learning | CodeCode Available | 1 |
| ETAD: Training Action Detection End to End on a Laptop | May 14, 2022 | Action DetectionGPU | CodeCode Available | 1 |
| BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection | May 5, 2022 | Action Detectionobject-detection | CodeCode Available | 1 |
| i-Code: An Integrative and Composable Multimodal Learning Framework | May 3, 2022 | Contrastive LearningVideo Understanding | —Unverified | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 |
| Causal Reasoning Meets Visual Representation Learning: A Prospective Study | Apr 26, 2022 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| Contrastive Language-Action Pre-training for Temporal Localization | Apr 26, 2022 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Revealing Occlusions with 4D Neural Fields | Apr 22, 2022 | Video Understanding | —Unverified | 0 |
| A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions | Apr 21, 2022 | Action DetectionVideo Understanding | CodeCode Available | 1 |
| Less than Few: Self-Shot Video Instance Segmentation | Apr 19, 2022 | Few-Shot LearningInstance Segmentation | —Unverified | 0 |
| ActAR: Actor-Driven Pose Embeddings for Video Action Recognition | Apr 19, 2022 | Action RecognitionOptical Flow Estimation | —Unverified | 0 |
| Adversarial Machine Learning Attacks Against Video Anomaly Detection Systems | Apr 7, 2022 | Anomaly DetectionBIG-bench Machine Learning | —Unverified | 0 |
| MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization | Apr 6, 2022 | Action LocalizationAction Recognition | —Unverified | 0 |
| Temporal Alignment Networks for Long-term Video | Apr 6, 2022 | Action RecognitionAction Segmentation | CodeCode Available | 1 |
| An Empirical Study of End-to-End Temporal Action Detection | Apr 6, 2022 | Action ClassificationAction Detection | CodeCode Available | 1 |
| Long Movie Clip Classification with State-Space Video Models | Apr 4, 2022 | ClassificationDecoder | CodeCode Available | 1 |
| PYSKL: a toolbox for skeleton-based video understanding | Apr 2, 2022 | Skeleton Based Action RecognitionVideo Understanding | —Unverified | 0 |
| SPAct: Self-supervised Privacy Preservation for Action Recognition | Mar 29, 2022 | Action ClassificationAction Recognition | CodeCode Available | 1 |
| How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? | Mar 27, 2022 | Self-Supervised LearningSensitivity | CodeCode Available | 1 |
| FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks | Mar 24, 2022 | Action RecognitionRetrieval | CodeCode Available | 0 |
| VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training | Mar 23, 2022 | 4kAction Classification | CodeCode Available | 3 |
| On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis | Mar 15, 2022 | Video Understanding | CodeCode Available | 0 |
| Human Gaze Guided Attention for Surgical Activity Recognition | Mar 9, 2022 | Activity RecognitionVideo Understanding | —Unverified | 0 |
| Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding | Mar 8, 2022 | Contrastive LearningSentence | —Unverified | 0 |
| Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection | Mar 1, 2022 | AvgBoundary Detection | —Unverified | 0 |
| Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment | Feb 28, 2022 | 3D Action RecognitionAction Analysis | CodeCode Available | 1 |
| Concept Graph Neural Networks for Surgical Video Understanding | Feb 27, 2022 | Video Understanding | —Unverified | 0 |
| Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations | Feb 21, 2022 | Answer GenerationVideo Understanding | —Unverified | 0 |
| ActionFormer: Localizing Moments of Actions with Transformers | Feb 16, 2022 | Action LocalizationAction Recognition | CodeCode Available | 2 |
| Learning Optical Flow with Adaptive Graph Reasoning | Feb 8, 2022 | Motion EstimationOptical Flow Estimation | CodeCode Available | 1 |
| A Coding Framework and Benchmark towards Low-Bitrate Video Understanding | Feb 6, 2022 | Video CompressionVideo Understanding | CodeCode Available | 0 |
| A Dataset for Medical Instructional Video Classification and Question Answering | Jan 30, 2022 | ClassificationQuestion Answering | CodeCode Available | 1 |
| Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition | Jan 25, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 0 |
| End-to-end Generative Pretraining for Multimodal Video Captioning | Jan 20, 2022 | Action ClassificationDecoder | —Unverified | 0 |
| Multiview Transformers for Video Recognition | Jan 12, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| Memory-Guided Semantic Learning Network for Temporal Sentence Grounding | Jan 3, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 |
| Recurring the Transformer for Video Action Recognition | Jan 1, 2022 | Action RecognitionGPU | —Unverified | 0 |
| Improving Video Model Transfer With Dynamic Representation Learning | Jan 1, 2022 | Action ClassificationKnowledge Distillation | —Unverified | 0 |
| YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset | Jan 1, 2022 | ManagementSegmentation | —Unverified | 0 |
| UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Jan 1, 2022 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| VRDFormer: End-to-End Video Visual Relation Detection With Transformers | Jan 1, 2022 | ObjectRelation | —Unverified | 0 |
| Video Joint Modelling Based on Hierarchical Transformer for Co-summarization | Dec 27, 2021 | RetrievalSupervised Video Summarization | CodeCode Available | 1 |
| Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs | Dec 18, 2021 | Graph GenerationObject | CodeCode Available | 0 |
| Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation | Dec 16, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 1 |
| Discrete neural representations for explainable anomaly detection | Dec 10, 2021 | Anomaly DetectionObject | —Unverified | 0 |