VideoChat: Chat-Centric Video Understanding

2023-05-10Code Available4· sign in to hype

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

Code Available — Be the first to reproduce this paper.

Code

github.com/opengvlab/ask-anything
OfficialIn paperpytorch★ 3,337

Abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Tasks

Question Answering Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video-based Generative Performance Benchmarking (Detail Orientation))Video-based Generative Performance Benchmarking (Temporal Understanding)Video Question Answering Video Understanding Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
NExT-QA (Open-ended VideoQA)	VideoChat	Accuracy	56.6	—	Unverified

VideoChat: Chat-Centric Video Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions