Improved Baselines with Visual Instruction Tuning

2023-10-05CVPR 2024Code Available6· sign in to hype

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Code Available — Be the first to reproduce this paper.

Code

github.com/haotian-liu/LLaVA
pytorch★ 24,603
github.com/LLaVA-VL/LLaVA-NeXT
pytorch★ 4,609
github.com/skunkworksai/bakllava
pytorch★ 719
github.com/sshh12/multi_token
pytorch★ 190
github.com/x2fd/lvis-instruct4v
none★ 134
github.com/linzhiqiu/clip-flant5
pytorch★ 30
github.com/albertotestoni/ndq_visual_objects
pytorch★ 2
github.com/dinhvietcuong1996/icme25-inova
pytorch★ 0

Abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Tasks

Factual Inconsistency Detection in Chart Captioning Image Classification Referring Expression Comprehension Referring expression generation Spatial Reasoning visual instruction following Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CHOCOLATE-FT	LLaVA-1.5-13B	Kendall's Tau-c	0.21	—	Unverified
CHOCOLATE-LLM	LLaVA-1.5-13B	Kendall's Tau-c	0.06	—	Unverified
CHOCOLATE-LVLM	LLaVA-1.5-13B	Kendall's Tau-c	0	—	Unverified

Improved Baselines with Visual Instruction Tuning

Code

Abstract

Tasks

Benchmark Results

Reproductions