LLaVA-OneVision: Easy Visual Task Transfer

2024-08-06Code Available0· sign in to hype

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/evolvinglmms-lab/lmms-eval
OfficialIn paperpytorch★ 3,924
github.com/MindSpore-scientific-2/code-14/tree/main/llava_next
mindspore★ 0

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Tasks

3D Question Answering (3D-QA)Multiple-choice Temporal Relation Extraction Transfer Learning Video Question Answering Video Understanding Visual Question Answering Visual Question Answering (VQA)Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Vinoground	LLaVA-OneVision-Qwen2-7B	Text Score	41.6	—	Unverified
Vinoground	LLaVA-OneVision-Qwen2-72B	Text Score	48.4	—	Unverified

LLaVA-OneVision: Easy Visual Task Transfer

Code

Abstract

Tasks

Benchmark Results

Reproductions