SOTAVerified

[Re] Explaining Groups of Points in Low-Dimensional Representations

2021-01-31RCCode Available0· sign in to hype

Rajeev Verma, Jim Wagemans, Paras Dahal, Auke Elfrink

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Scope of Reproducibility This report covers our reproduction of the paper ʼExplaining Low dimensional Representationʼ [1] by Plumb et al. In this paper, a method (Transitive Global Translations, TGT) is proposed for explaining different clusters in low dimensional representations of high dimensional data. They show their method outperforms the Difference Between the Means (DBM) method, is consistent in explaining differences with few features and matches real patterns in data. We verify these claims by reproducing their experiments and testing their method on new data. We also investigate the use of more complex transformations to explain differences between clusters. Methodology We reproduce the original experiments using their source code. We also replicate their findings by re-implementing the authorsʼ method in PyTorch [2] and evaluating on two of the dataset used in the paper and two new ones. Furthermore, we compare TGT with our own extension of TGT, which uses a larger class of transformations. Results We were able to reproduce their results using their code, yielding mostly similar results. TGT generally outperforms DBM, especially when explanations use few features. TGT is consistent in terms of the features to which it attributes cluster differences, across different sparsity levels. TGT matches real patterns in data. When extending the types of functions used for explanations, performance did not improve significantly, suggesting translations make for adequate explanations. However, the scaling extension shows promising performance on the modified synthetic data to recover the original signal. What was easy The easiest part was running the existing code with the pre-trained model files. The original authors had set up their code base in an organized manner with clear instructions. What was difficult The first difficulty that we encounter was finding the right environment. The source code depends on deprecated functionality. The clustering method they used, had to be re implemented for us to use it in our replication. Another difficulty was the selection of clusters. The authors did not prove a consistent method for selecting clusters in a latent space representation. When retraining the provided models, we get a latent space representation different to the original experiments. The clusters have to be manually selected. The metrics that they used to evaluate their explanations are also depend on the clustering. This means that there is some variability in the exact verification of reproducibility. Communication with original authors We asked the original authors for clarification on how to choose the ϵ hyper-parameter. However, it became apparent that we had misread, and the procedure is indeed adequately reported in the paper.

Tasks

Reproductions