AttriVision: Advancing Generalization in Pedestrian Attribute Recognition using CLIP

2025-02-18WACV 2025Unverified0· sign in to hype

Mehran ADIBI SEDEH, Assia Benbihi, Romain MARTIN, Marianne Clausel, Cédric Pradalier

Unverified — Be the first to reproduce this paper.

Abstract

Pedestrian Attribute Recognition (PAR) is a critical task in computer vision that identifies semantic attributes such as gender, age, clothing, and accessories from images of individuals. This task is essential in applications such as surveillance, smart city infrastructure, and security systems. Despite significant advances in deep learning, PAR remains challenging due to strong imbalances in the attribute classes and the need for robust generalization across different datasets and environments. In this work, we address these two limitations with AttriVision, a novel approach that adopts the generic CLIP features to make PAR better generalize and introduces a new Focal CrossEntropy (FCE) loss function to handle the inherent class imbalance in PAR datasets. FCE improves the model’s robustness by giving more weight to difficult-to-classify samples. Our method also demonstrates remarkable transferability to other attribute recognition tasks, such as vehicle attributes, without any architectural modifications. This transferability makes AttriVision a powerful and versatile tool for attribute recognition. We validate our approach on the Unified Pedestrian Attribute Recognition (UPAR) dataset that integrates data from several sources including PA100K, PETA, RAPv2, and Market1501. AttriVision achieves new state-of-the-art results on UPAR, with a mean accuracy of 89.4% and an F1 score of 91.9%. These results demonstrate the model’s effectiveness in handling real-world variability, including differences in image sensors, viewing conditions, and person densities, making it highly suitable for a wide range of real-world applications.

Tasks

Attribute Pedestrian Attribute Recognition

AttriVision: Advancing Generalization in Pedestrian Attribute Recognition using CLIP

Abstract

Tasks

Reproductions