Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

2024-03-10Code Available3· sign in to hype

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang

Code Available — Be the first to reproduce this paper.

Code

github.com/zhuyiche/llava-phi
OfficialIn paperpytorch★ 401

Abstract

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.

Tasks

Visual Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MM-Vet	Mipha-3B	GPT-4 score	32.1	—	Unverified

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions