Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Matteo Tucat, Anirbit Mukherjee*, Omar Rivasplata

*Corresponding author for this work

Research output: Preprint/Working paperPreprint

8 Downloads (Pure)

Abstract

We instantiate a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks, provided that the layers are of sufficient width. The algorithm presented here constitutes a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets.

We also present empirical evidence that our theoretically founded regularized gradient clipping algorithm is competitive with the state-of-the-art deep learning heuristics. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven to be true for neural networks at any depth within a neighbourhood of the initialisation.
Original languageEnglish
Number of pages17
Publication statusIn preparation - 12 Apr 2024

Keywords

  • optimization algorithms
  • deep learning
  • stochastic optimization

Fingerprint

Dive into the research topics of 'Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks'. Together they form a unique fingerprint.

Cite this