Scaling Open-Vocabulary Action Detection

2025-04-04Code Available0· sign in to hype

Zhen Hao Sia, Yogesh Singh Rawat

Code Available — Be the first to reproduce this paper.

Code

github.com/siatheindochinese/sia_act_placeholder
Officialnone★ 2

Abstract

In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

Tasks

Action Detection Multiple Action Detection Open Vocabulary Action Detection Spatio-Temporal Action Localization Video Action Detection Zero-Shot Action Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
J-HMDB	SiA	Frame-mAP 0.5	88.5	—	Unverified
MultiSports	SiA	Frame-mAP 0.5	28.8	—	Unverified
UCF101-24	SiA	Frame-mAP 0.5	88.5	—	Unverified

Scaling Open-Vocabulary Action Detection

Code

Abstract

Tasks

Benchmark Results

Reproductions