Visual Instruction Tuning

2023-04-17NeurIPS 2023Code Available6· sign in to hype

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Code Available — Be the first to reproduce this paper.

Code

github.com/haotian-liu/LLaVA
Officialpytorch★ 24,603
github.com/LLaVA-VL/LLaVA-NeXT
pytorch★ 4,609
github.com/computer-vision-in-the-wild/cvinw_readings
In papernone★ 1,363
github.com/skunkworksai/bakllava
pytorch★ 719
github.com/tabtoyou/kollava
pytorch★ 296
github.com/camenduru/llava-colab
none★ 228
github.com/sshh12/multi_token
pytorch★ 190
github.com/sunsmarterjie/chatterbox
pytorch★ 61
github.com/ZhangYiqun018/StickerConv
pytorch★ 59
github.com/llava-annonymous/llava
In paperpytorch★ 48

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Tasks

1 Image, 2*2 Stitching 3D Question Answering (3D-QA)Image Classification Image Retrieval Instruction Following MMR total Referring Expression Comprehension Referring expression generation Spatial Reasoning Video Question Answering visual instruction following Visual Question Answering Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ColonINST-v1 (Seen)	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray	89.61	—	Unverified
ColonINST-v1 (Seen)	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray	87.86	—	Unverified
ColonINST-v1 (Unseen)	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray	72.08	—	Unverified
ColonINST-v1 (Unseen)	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray	68.11	—	Unverified
ColonINST-v1 (Unseen)	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray	42.17	—	Unverified

Visual Instruction Tuning

Code

Abstract

Tasks

Benchmark Results

Reproductions