MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

2023-03-15ICCV 2023Code Available1· sign in to hype

Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/wlin-at/maxi
OfficialIn paperpytorch★ 30

Abstract

Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at https://github.com/wlin-at/MAXI.

Tasks

Action Recognition Few-Shot action recognition Few Shot Action Recognition Image Generation Multiple Instance Learning Video Recognition Zero-Shot Action Recognition Zero-Shot Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Charades	MAXI	mAP	23.8	—	Unverified
HMDB51	MAXI	Top-1 Accuracy	52.3	—	Unverified
Kinetics	MAXI	Top-1 Accuracy	71.6	—	Unverified
UCF101	MAXI	Top-1 Accuracy	78.2	—	Unverified

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Code

Abstract

Tasks

Benchmark Results

Reproductions