Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor
Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/ethan-y-lin/beyond_accuracyOfficialIn paperpytorch★ 2
Abstract
Text-based visual descriptors--ranging from simple class names to more descriptive phrases--are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy. These metrics shed light on how different descriptor generation strategies interact with foundation model properties, offering new ways to study descriptor effectiveness beyond accuracy evaluations.