Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

2023-06-08Code Available3· sign in to hype

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

Code Available — Be the first to reproduce this paper.

Code

github.com/mbzuai-oryx/video-chatgpt
OfficialIn paperpytorch★ 1,498

Abstract

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

Tasks

Question Answering VCGBench-Diverse Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video-based Generative Performance Benchmarking (Detail Orientation))Video-based Generative Performance Benchmarking (Temporal Understanding)Video Question Answering Video Understanding Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
NExT-QA (Open-ended VideoQA)	Video-ChatGPT	Accuracy	54.6	—	Unverified

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions