LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/evolvinglmms-lab/lmms-evalOfficialIn paperpytorch★ 3,924
- github.com/MindSpore-scientific-2/code-14/tree/main/llava_nextmindspore★ 0
Abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Vinoground | LLaVA-OneVision-Qwen2-7B | Text Score | 41.6 | — | Unverified |
| Vinoground | LLaVA-OneVision-Qwen2-72B | Text Score | 48.4 | — | Unverified |