Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under (L_0, L_1)-Smoothness
Nikita Kornilov, Philip Zmushko, Andrei Semenov, Mark Ikonnikov, Alexander Gasnikov, Alexander Beznosikov
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In recent years, non-convex optimization problems are more often described by generalized (L_0, L_1)-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded -th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization. In theory, we prove the first-known high probability convergence bounds under (L_0, L_1)-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity O(( L_0d^2 + L_1d^32)[1 + ()^-1]), (1,2]. Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of (0,2] with complexity O(( L_0d^2 + L_1d^32)[1^2 + ^2^2]). We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.