M&M Mix: A Multimodal Multiview Transformer Ensemble

2022-06-20Unverified0· sign in to hype

Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

Unverified — Be the first to reproduce this paper.

Abstract

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

Tasks

Action Recognition Video Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
EPIC-KITCHENS-100	M&M (WTS 60M)	Action@1	53.6	—	Unverified

M&M Mix: A Multimodal Multiview Transformer Ensemble

Abstract

Tasks

Benchmark Results

Reproductions