Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/xiaoduoailab/xmodelvlmOfficialIn paperpytorch★ 68
- github.com/xiaoduoailab/xmodellmpytorch★ 38
- github.com/MindCode-4/code-5/tree/main/xmodmindspore★ 0
- github.com/pwc-1/Paper-9/tree/main/5/xmodmindspore★ 0
Abstract
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| MM-Vet | Xmodel-VLM (Xmodel-LM 1.1B) | GPT-4 score | 21.8 | — | Unverified |