LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

2023-11-09Code Available2· sign in to hype

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/LLaVA-VL/LLaVA-Plus-Codebase
Officialpytorch★ 764

Abstract

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

Tasks

Instruction Following LLM real-life tasks LMM real-life tasks Retrieval Visual Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Leaderboard	LLaVA-Plus (13B)	ELO Rating	1,203	—	Unverified

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Code

Abstract

Tasks

Benchmark Results

Reproductions