Scaling Open-Vocabulary Action Detection
Zhen Hao Sia, Yogesh Singh Rawat
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/siatheindochinese/sia_act_placeholderOfficialnone★ 2
Abstract
In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| J-HMDB | SiA | Frame-mAP 0.5 | 88.5 | — | Unverified |
| MultiSports | SiA | Frame-mAP 0.5 | 28.8 | — | Unverified |
| UCF101-24 | SiA | Frame-mAP 0.5 | 88.5 | — | Unverified |