Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

2022-07-15Unverified0· sign in to hype

Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

Unverified — Be the first to reproduce this paper.

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present MOV, a simple yet effective method for Multimodal Open-Vocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Tasks

Optical Flow Estimation Video Classification Zero-Shot Action Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
HMDB51	MOV (ViT-L/14)	Top-1 Accuracy	64.7	—	Unverified
HMDB51	MOV (ViT-B/16)	Top-1 Accuracy	60.8	—	Unverified
UCF101	MOV (ViT-L/14)	Top-1 Accuracy	87.1	—	Unverified
UCF101	MOV (ViT-B/16)	Top-1 Accuracy	82.6	—	Unverified

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Abstract

Tasks

Benchmark Results

Reproductions