TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

2025-04-13Code Available2· sign in to hype

Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

Code Available — Be the first to reproduce this paper.

Code

github.com/zhangxj199/tinyllava-video-r1
OfficialIn paperpytorch★ 115

Abstract

Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models' reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of "aha moments". Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at https://github.com/ZhangXJ199/TinyLLaVA-Video-R1.

Tasks

Question Answering reinforcement-learning Reinforcement Learning Video Understanding

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Code

Abstract

Tasks

Reproductions