| STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition | Jan 8, 2023 | Action RecognitionFacial Expression Recognition (FER) | —Unverified | 0 |
| EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding | Jan 5, 2023 | Video Understanding | —Unverified | 0 |
| PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval | Jan 1, 2023 | Representation LearningRetrieval | —Unverified | 0 |
| Self-Supervised Object Detection from Egocentric Videos | Jan 1, 2023 | Class-agnostic Object DetectionObject | —Unverified | 0 |
| Relational Space-Time Query in Long-Form Videos | Jan 1, 2023 | FormVideo Understanding | —Unverified | 0 |
| Few-Shot Referring Relationships in Videos | Jan 1, 2023 | ObjectRelation Network | CodeCode Available | 0 |
| UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding | Jan 1, 2023 | Video Understanding | —Unverified | 0 |
| Inverse Compositional Learning for Weakly-supervised Relation Grounding | Jan 1, 2023 | RelationVideo Understanding | —Unverified | 0 |
| Multimodal High-order Relation Transformer for Scene Boundary Detection | Jan 1, 2023 | Boundary DetectionDecoder | —Unverified | 0 |
| Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction | Dec 28, 2022 | Data AugmentationFace Swapping | —Unverified | 0 |
| Inductive Attention for Video Action Anticipation | Dec 17, 2022 | Action AnticipationAction Recognition | —Unverified | 0 |
| Egocentric Video Task Translation | Dec 13, 2022 | Multi-Task LearningTranslation | —Unverified | 0 |
| Contextual Explainable Video Representation: Human Perception-based Understanding | Dec 12, 2022 | Action DetectionAction Recognition | CodeCode Available | 0 |
| PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data | Dec 8, 2022 | Action RecognitionPrompt Learning | —Unverified | 0 |
| Transition Is a Process: Pair-to-Video Change Detection Networks for Very High Resolution Remote Sensing Images | Dec 7, 2022 | Building change detection for remote sensing imagesChange Detection | —Unverified | 0 |
| Spatio-Temporal Crop Aggregation for Video Representation Learning | Nov 30, 2022 | Action ClassificationDimensionality Reduction | —Unverified | 0 |
| Dynamic Appearance: A Video Representation for Action Recognition with Joint Training | Nov 23, 2022 | Action RecognitionTemporal Action Localization | —Unverified | 0 |
| A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset | Nov 19, 2022 | Common Sense ReasoningGraph Embedding | —Unverified | 0 |
| Masked Autoencoders for Egocentric Video Understanding @ Ego4D Challenge 2022 | Nov 18, 2022 | Object State Change ClassificationTemporal Localization | CodeCode Available | 0 |
| Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022 | Nov 16, 2022 | Human-Object Interaction DetectionObject | —Unverified | 0 |
| Grounded Video Situation Recognition | Oct 19, 2022 | DescriptiveStructured Prediction | —Unverified | 0 |
| How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios | Oct 18, 2022 | Video Understanding | CodeCode Available | 0 |
| Self-supervised video pretraining yields robust and more human-aligned visual representations | Oct 12, 2022 | Contrastive Learningobject-detection | —Unverified | 0 |
| Students taught by multimodal teachers are superior action recognizers | Oct 9, 2022 | Action RecognitionKnowledge Distillation | —Unverified | 0 |
| Compressed Vision for Efficient Video Understanding | Oct 6, 2022 | Video CompressionVideo Understanding | —Unverified | 0 |
| Learning to Focus on the Foreground for Temporal Sentence Grounding | Oct 1, 2022 | SentenceTemporal Sentence Grounding | —Unverified | 0 |
| In-the-Wild Video Question Answering | Oct 1, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain | Sep 29, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| AVT: Audio-Video Transformer for Multimodal Action Recognition | Sep 22, 2022 | Action RecognitionAudio Classification | —Unverified | 0 |
| WildQA: In-the-Wild Video Question Answering | Sep 14, 2022 | Evidence SelectionQuestion Answering | —Unverified | 0 |
| Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions | Sep 7, 2022 | Image GenerationText to Image Generation | —Unverified | 0 |
| Visual Subtitle Feature Enhanced Video Outline Generation | Aug 24, 2022 | ArticlesHeadline Generation | —Unverified | 0 |
| Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding | Aug 22, 2022 | Action RecognitionMulti-Task Learning | —Unverified | 0 |
| Motion Sensitive Contrastive Learning for Self-supervised Video Representation | Aug 12, 2022 | Contrastive LearningRepresentation Learning | —Unverified | 0 |
| Exploring Anchor-based Detection for Ego4D Natural Language Query | Aug 10, 2022 | Video Understanding | —Unverified | 0 |
| SA-NET.v2: Real-time vehicle detection from oblique UAV images with use of uncertainty estimation in deep meta-learning | Aug 4, 2022 | Meta-LearningSemantic Segmentation | —Unverified | 0 |
| Two-Stream Transformer Architecture for Long Video Understanding | Aug 2, 2022 | Action RecognitionGPU | —Unverified | 0 |
| BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation | Aug 1, 2022 | ObjectOptical Flow Estimation | —Unverified | 0 |
| EgoEnv: Human-centric environment representations from egocentric video | Jul 22, 2022 | Video Understanding | —Unverified | 0 |
| Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022 | Jul 22, 2022 | ObjectObject State Change Classification | —Unverified | 0 |
| AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding | Jul 21, 2022 | Action RecognitionVideo Understanding | —Unverified | 0 |
| An Efficient Spatio-Temporal Pyramid Transformer for Action Detection | Jul 21, 2022 | Action DetectionVideo Understanding | —Unverified | 0 |
| SVGraph: Learning Semantic Graphs from Instructional Videos | Jul 16, 2022 | Graph LearningVideo Understanding | —Unverified | 0 |
| GraphVid: It Only Takes a Few Nodes to Understand a Video | Jul 4, 2022 | SuperpixelsVideo Understanding | —Unverified | 0 |
| Multimodal Intent Discovery from Livestream Videos | Jul 1, 2022 | Intent DiscoveryVideo Summarization | —Unverified | 0 |
| (Un)likelihood Training for Interpretable Embedding | Jul 1, 2022 | Ad-hoc video searchDecoder | CodeCode Available | 0 |
| Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering | Jul 1, 2022 | Question AnsweringVideo Question Answering | —Unverified | 0 |
| Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach | Jun 30, 2022 | Boundary DetectionGeneric Event Boundary Detection | CodeCode Available | 0 |
| Technical Report for CVPR 2022 LOVEU AQTC Challenge | Jun 29, 2022 | Video Understanding | CodeCode Available | 0 |
| Multimodal Dialogue State Tracking | Jun 16, 2022 | Dialogue State TrackingVideo Understanding | CodeCode Available | 0 |