LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

2024-07-10Code Available7· sign in to hype

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li

Code Available — Be the first to reproduce this paper.

Code

github.com/LLaVA-VL/LLaVA-NeXT
OfficialIn paperpytorch★ 4,609
github.com/pwc-1/Paper-9/tree/main/llava_next
mindspore★ 0
github.com/dinhvietcuong1996/icme25-inova
pytorch★ 0

Abstract

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT

Tasks

Video Question Answering Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
NExT-QA	LLaVA-NeXT-Interleave(14B)	Accuracy	79.1	—	Unverified
NExT-QA	LLaVA-NeXT-Interleave(7B)	Accuracy	78.2	—	Unverified
NExT-QA	LLaVA-NeXT-Interleave(DPO)	Accuracy	77.9	—	Unverified

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Code

Abstract

Tasks

Benchmark Results

Reproductions