SOTAVerified

Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

2024-11-06Code Available0· sign in to hype

Ping Li, Tao Wang, Xinkui Zhao, Xianghua Xu, Mingli Song

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions ( , 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications ( , edit words), the former module guides the model to edit words using some actions ( , copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at https://github.com/mlvccn/PKG_VidCap

Tasks

Reproductions