Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present QICA, a novel framework that synergizes quantity perception with robust spatial cast aggregation. Specifically, we introduce a Synergistic Prompting Strategy (SPS) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (CAD) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss (L_MQA) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.