Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

2024-05-15Code Available2· sign in to hype

Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang

Code Available — Be the first to reproduce this paper.

Code

github.com/xiaoduoailab/xmodelvlm
OfficialIn paperpytorch★ 68
github.com/xiaoduoailab/xmodellm
pytorch★ 38
github.com/MindCode-4/code-5/tree/main/xmod
mindspore★ 0
github.com/pwc-1/Paper-9/tree/main/5/xmod
mindspore★ 0

Abstract

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Tasks

GPU Language Modeling Language Modelling Visual Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MM-Vet	Xmodel-VLM (Xmodel-LM 1.1B)	GPT-4 score	21.8	—	Unverified

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Code

Abstract

Tasks

Benchmark Results

Reproductions