A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural Networks
Saptarshi Mandal, Xiaojun Lin, R. Srikant
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Knowledge distillation, where a small student model learns from a pre-trained large teacher model, has achieved substantial empirical success since the seminal work of hinton2015distilling. Despite prior theoretical studies exploring the benefits of knowledge distillation, an important question remains unanswered: why does soft-label training from the teacher require significantly fewer neurons than directly training a small neural network with hard labels? To address this, we first present motivating experimental results using simple neural network models on a binary classification problem. These results demonstrate that soft-label training consistently outperforms hard-label training in accuracy, with the performance gap becoming more pronounced as the dataset becomes increasingly difficult to classify. We then substantiate these observations with a theoretical contribution based on two-layer neural network models. Specifically, we show that soft-label training using gradient descent requires only O(1^2 ) neurons to achieve a classification loss averaged over epochs smaller than some > 0, where is the separation margin of the limiting kernel. In contrast, hard-label training requires O(1^4 (1)) neurons, as derived from an adapted version of the gradient descent analysis in ji2020polylogarithmic. This implies that when , i.e., when the dataset is challenging to classify, the neuron requirement for soft-label training can be significantly lower than that for hard-label training. Finally, we present experimental results on deep neural networks, further validating these theoretical findings.