SOTAVerified

Perception Test: A Diagnostic Benchmark for Multimodal Models

2022-10-19Deep Mind 2022Code Available2· sign in to hype

Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Skanda Koppula, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman and João Carreira

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We propose a novel multimodal benchmark – the Perception Test – that aims to extensively evaluate perception and reasoning skills of multimodal models. The Perception Test introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or finetuning regime. Evaluation results are provided as a multi-dimensional diagnostic report, detailing models’ strengths and weaknesses on various perception skills, computational tasks, and types of reasoning. Preliminary results from a human baseline compared to state-of-the-art video question answering models show a significant gap in performance (91.4% vs 36%) suggesting that perception is far from being solved. The training and validation splits of the benchmark are publicly available for download at https://github.com/deepmind/perception_test, under CC-BY license, together with per-task baseline results. We hope that the Perception Test will inspire and contribute to progress towards more general perception models.

Tasks

Reproductions