CrossVideoMAE: Contrastive Spatiotemporal and Semantic Representation Learning from Videos and Images with Masked Autoencoders

2025-02-08Unverified0· sign in to hype

Shihab Aaqil Ahamed, Malitha Gunawardhana, Liel David, Michael Sidorov, Daniel Harari, Muhammad Haris Khan

Unverified — Be the first to reproduce this paper.

Abstract

Current video-based Masked Autoencoders (MAEs) primarily learn general spatial-temporal patterns from a visual perspective but often overlook nuanced semantic attributes like specific interactions or sequences that define actions align more closely with human cognition for space-time correspondence. This can limit the model’s ability to capture the essence of certain actions that are contextually rich and continuous. Humans can map visual concepts, object view invariance, and semantic attributes available in static instances to comprehend natural dynamic scenes or videos. Existing MAEs for videos and static images rely on frame-by-frame continuity or separate datasets for videos and images, which may lack the rich semantic attributes necessary for fully understanding the learned concepts, especially when compared to using video and corresponding sampled frame images together. To this end, we propose CrossVideoMAE, a self-supervised image-video contrastive MAE pre-training framework that effectively learns both video-level and frame-level spatiotemporal representations and semantic attributes. Our method integrates mutual spatiotemporal information from videos with spatial information from sampled frames within a feature-invariant space while encouraging invariance to augmentations within the video domain. This objective is achieved through jointly embedding features of visible tokens and combining feature correspondence within and across modalities, which is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner. Extensive experiments demonstrate that our approach surpasses previous SOTA methods and ablation studies validate the effectiveness of our approach.

Tasks

Representation Learning

CrossVideoMAE: Contrastive Spatiotemporal and Semantic Representation Learning from Videos and Images with Masked Autoencoders

Abstract

Tasks

Reproductions