Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/haotian-liu/LLaVApytorch★ 24,603
- github.com/LLaVA-VL/LLaVA-NeXTpytorch★ 4,609
- github.com/skunkworksai/bakllavapytorch★ 719
- github.com/sshh12/multi_tokenpytorch★ 190
- github.com/x2fd/lvis-instruct4vnone★ 134
- github.com/linzhiqiu/clip-flant5pytorch★ 30
- github.com/albertotestoni/ndq_visual_objectspytorch★ 2
- github.com/dinhvietcuong1996/icme25-inovapytorch★ 0
Abstract
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| CHOCOLATE-FT | LLaVA-1.5-13B | Kendall's Tau-c | 0.21 | — | Unverified |
| CHOCOLATE-LLM | LLaVA-1.5-13B | Kendall's Tau-c | 0.06 | — | Unverified |
| CHOCOLATE-LVLM | LLaVA-1.5-13B | Kendall's Tau-c | 0 | — | Unverified |