Taxonomy-Aware Evaluation of Vision-Language Models

2025-01-01CVPR 2025Unverified0· sign in to hype

Vésteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, Stella Frank

Unverified — Be the first to reproduce this paper.

Abstract

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer "I see a conifer," rather than the specific label "Norway spruce". This raises two issues for evaluation: Firstly, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., "conifer"). Secondly, a useful classification measure should give partial credit to less specific, but not incorrect, answers ("Norway spruce" being a type of "conifer"). To meet these requirements, we propose a framework for evaluating unconstrained text predictions such as those generated from a vision-language model against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Tasks

Fine-Grained Image Classification Language Modeling Language Modelling Specificity text similarity

Taxonomy-Aware Evaluation of Vision-Language Models

Abstract

Tasks

Reproductions