Loss Functions and Operators Generated by f-Divergences
Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael E. Sander, Mathieu Blondel
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback--Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on f-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with f-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous f-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an f-divergence is associated with an operator, that we dub f-softargmax. We derive a novel parallelizable bisection algorithm for computing the f-softargmax associated with any f-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the -divergence (which is equivalent to Tsallis -negentropy in the case of unit reference measures) with =1.5 performs well across several tasks.