SOTAVerified

Synthetic sampling from small datasets: A modified mega-trend diffusion approach using k-nearest neighbors

2021-11-14Knowledge-Based Systems 2021Code Available0· sign in to hype

Jayanth Sivakumar, Karthik Ramamurthy, Menaka Radhakrishnan, Daehan Won

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Data generation techniques are one of the emerging trends in machine learning in the last decade. Despite huge data availability, small datasets are still an issue to tackle for decision-making purposes. Synthetic data generation is a promising alternative for the small dataset problem. In addition, previous methodologies address the data generation for only one of the tasks: supervised or unsupervised. A modified Mega-Trend Diffusion (MTD) approach, k-Nearest Neighbor Mega-Trend Diffusion (kNNMTD), is proposed in this research to address these challenges. The method identifies the closest subsamples using the k-Nearest Neighbors (kNN) algorithm and applies MTD to the subsample neighbors to estimate the domain ranges. The proposed methodology provides the functionality to generate data for any data-driven tasks. kNNMTD is compared with baseline MTD, CTGAN, and synthetic minority oversampling technique (SMOTE) for classification tasks as well as against SMOTE for regression (SmoteR) for regression tasks. The proposed method is validated using some of the benchmark datasets as well as the simulated datasets along with a case study. Pairwise correlation difference (PCD) is used to compare the similarity between real and synthetic datasets. kNNMTD outperforms baseline MTD and CTGAN on all the datasets and shows the statistical significance of the proposed methodology. Some of the benchmark datasets show low average PCD values as well as the statistical differences against SMOTE and SmoteR using kNNMTD. In the case study, kNNMTD generates data with the lowest PCD values compared to the other methods for both classification (1.2077) and ordinal regression (1.6017) tasks.

Tasks

Reproductions