ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang, Junbang Liang, Chun-Kai Wang, Kenan Deng, Yu Lou, Ming Lin, Shan Yang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/xijun-cs/vilaOfficialIn paperpytorch★ 13
Abstract
In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| NExT-QA | ViLA (3B) | Accuracy | 75.6 | — | Unverified |
| NExT-QA | ViLA (3B, 4 frames) | Accuracy | 74.4 | — | Unverified |
| NExT-QA (Efficient) | ViLA (3B, 4 frames) | 1:1 Accuracy | 74.4 | — | Unverified |
| STAR Benchmark | VLAP (4 frames) | Average Accuracy | 67.1 | — | Unverified |