Video Instruction Tuning With Synthetic Data

2024-10-03Unverified0· sign in to hype

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

Unverified — Be the first to reproduce this paper.

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Tasks

3D Question Answering (3D-QA)Instruction Following Multiple-choice Open-Ended Question Answering Question Answering Video Question Answering Visual Question Answering (VQA)Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
NExT-QA	LLaVA-Video	Accuracy	83.2	—	Unverified
TVBench	LLaVA-Video 72B	Average Accuracy	50	—	Unverified
TVBench	LLaVA-Video 7B	Average Accuracy	45.6	—	Unverified

Video Instruction Tuning With Synthetic Data

Abstract

Tasks

Benchmark Results

Reproductions