Training on Plausible Counterfactuals Removes Spurious Correlations

2025-05-22Unverified0· sign in to hype

Shpresim Sadiku, Kartikeya Chitranshi, Hiroshi Kera, Sebastian Pokutta

Unverified — Be the first to reproduce this paper.

Abstract

Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced incorrect target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

Tasks

counterfactual

Training on Plausible Counterfactuals Removes Spurious Correlations

Abstract

Tasks

Reproductions