| Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens | Jun 13, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey | Jun 5, 2022 | 3D Hand Pose EstimationDomain Adaptation | —Unverified | 0 |
| Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding | Jun 1, 2022 | Knowledge GraphsVideo Understanding | —Unverified | 0 |
| i-Code: An Integrative and Composable Multimodal Learning Framework | May 3, 2022 | Contrastive LearningVideo Understanding | —Unverified | 0 |
| Overview of the MedVidQA 2022 Shared Task on Medical Video Question-Answering | May 1, 2022 | Question AnsweringVideo Classification | —Unverified | 0 |
| Contrastive Language-Action Pre-training for Temporal Localization | Apr 26, 2022 | Action LocalizationContrastive Learning | —Unverified | 0 |
| Causal Reasoning Meets Visual Representation Learning: A Prospective Study | Apr 26, 2022 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| Revealing Occlusions with 4D Neural Fields | Apr 22, 2022 | Video Understanding | —Unverified | 0 |
| Less than Few: Self-Shot Video Instance Segmentation | Apr 19, 2022 | Few-Shot LearningInstance Segmentation | —Unverified | 0 |
| ActAR: Actor-Driven Pose Embeddings for Video Action Recognition | Apr 19, 2022 | Action RecognitionOptical Flow Estimation | —Unverified | 0 |
| Adversarial Machine Learning Attacks Against Video Anomaly Detection Systems | Apr 7, 2022 | Anomaly DetectionBIG-bench Machine Learning | —Unverified | 0 |
| MM-SEAL: A Large-scale Video Dataset of Multi-person Multi-grained Spatio-temporally Action Localization | Apr 6, 2022 | Action LocalizationAction Recognition | —Unverified | 0 |
| PYSKL: a toolbox for skeleton-based video understanding | Apr 2, 2022 | Skeleton Based Action RecognitionVideo Understanding | —Unverified | 0 |
| FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks | Mar 24, 2022 | Action RecognitionRetrieval | CodeCode Available | 0 |
| On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis | Mar 15, 2022 | Video Understanding | CodeCode Available | 0 |
| Human Gaze Guided Attention for Surgical Activity Recognition | Mar 9, 2022 | Activity RecognitionVideo Understanding | —Unverified | 0 |
| Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding | Mar 8, 2022 | Contrastive LearningSentence | —Unverified | 0 |
| Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection | Mar 1, 2022 | AvgBoundary Detection | —Unverified | 0 |
| Concept Graph Neural Networks for Surgical Video Understanding | Feb 27, 2022 | Video Understanding | —Unverified | 0 |
| Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations | Feb 21, 2022 | Answer GenerationVideo Understanding | —Unverified | 0 |
| A Coding Framework and Benchmark towards Low-Bitrate Video Understanding | Feb 6, 2022 | Video CompressionVideo Understanding | CodeCode Available | 0 |
| Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition | Jan 25, 2022 | Action RecognitionOptical Flow Estimation | CodeCode Available | 0 |
| End-to-end Generative Pretraining for Multimodal Video Captioning | Jan 20, 2022 | Action ClassificationDecoder | —Unverified | 0 |
| Multiview Transformers for Video Recognition | Jan 12, 2022 | Action ClassificationAction Recognition | —Unverified | 0 |
| MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound | Jan 7, 2022 | Action ClassificationNavigate | —Unverified | 0 |
| Memory-Guided Semantic Learning Network for Temporal Sentence Grounding | Jan 3, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 |
| VRDFormer: End-to-End Video Visual Relation Detection With Transformers | Jan 1, 2022 | ObjectRelation | —Unverified | 0 |
| YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset | Jan 1, 2022 | ManagementSegmentation | —Unverified | 0 |
| Improving Video Model Transfer With Dynamic Representation Learning | Jan 1, 2022 | Action ClassificationKnowledge Distillation | —Unverified | 0 |
| UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Jan 1, 2022 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| Recurring the Transformer for Video Action Recognition | Jan 1, 2022 | Action RecognitionGPU | —Unverified | 0 |
| Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs | Dec 18, 2021 | Graph GenerationObject | CodeCode Available | 0 |
| Discrete neural representations for explainable anomaly detection | Dec 10, 2021 | Anomaly DetectionObject | —Unverified | 0 |
| Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search | Dec 9, 2021 | Neural Architecture SearchVideo Recognition | —Unverified | 0 |
| Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning | Dec 7, 2021 | Contrastive LearningRepresentation Learning | CodeCode Available | 0 |
| Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips | Dec 2, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |
| LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering | Nov 29, 2021 | DiversityQuestion Answering | —Unverified | 0 |
| UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection | Nov 29, 2021 | Boundary DetectionContrastive Learning | —Unverified | 0 |
| Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework | Nov 16, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge | Nov 15, 2021 | Instance SegmentationObject Recognition | —Unverified | 0 |
| Re-ID-AR: Improved Person Re-identification in Video via Joint Weakly Supervised Action Recognition | Nov 1, 2021 | Action RecognitionPerson Re-Identification | CodeCode Available | 0 |
| Gradient Frequency Modulation for Visually Explaining Video Understanding Models | Nov 1, 2021 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding | Oct 31, 2021 | Action RecognitionText Detection | —Unverified | 0 |
| Leveraging Local Temporal Information for Multimodal Scene Classification | Oct 26, 2021 | ClassificationScene Classification | —Unverified | 0 |
| Can't Fool Me: Adversarially Robust Transformer for Video Understanding | Oct 26, 2021 | image-classificationImage Classification | —Unverified | 0 |
| NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels | Oct 13, 2021 | Action ClassificationSelf-Supervised Learning | CodeCode Available | 0 |
| CLIP4Caption: CLIP for Video Caption | Oct 13, 2021 | DecoderSentence | —Unverified | 0 |
| TAda! Temporally-Adaptive Convolutions for Video Understanding | Oct 12, 2021 | Action ClassificationAction Recognition | CodeCode Available | 0 |
| Toward a Human-Level Video Understanding Intelligence | Oct 8, 2021 | AI AgentVideo Understanding | —Unverified | 0 |
| Efficient Modelling Across Time of Human Actions and Interactions | Oct 5, 2021 | Action RecognitionVideo Understanding | —Unverified | 0 |