×

Non-convergence of stochastic gradient descent in the training of deep neural networks. (English) Zbl 1494.65044

Summary: Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

MSC:

65K10 Numerical optimization and variational techniques
68T07 Artificial neural networks and deep learning

References:

[1] Baldi, P.; Hornik, K., Neural networks and principal component analysis: Learning from examples without local minima, Neural Netw., 2, 1, 53-58 (1989)
[2] Beck, C.; Jentzen, A.; Kuckuck, B., Full error analysis for the training of deep neural networks (2019), arXiv:1910.00121v2
[3] Bottou, L.; Curtis, F. E.; Nocedal, J., Optimization methods for large-scale machine learning, SIAM Rev., 60, 2, 223-311 (2018) · Zbl 1397.65085
[4] Chizat, L.; Oyallon, E.; Bach, F., On lazy training in differentiable programming, (Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché Buc, F.; Fox, E.; Garnett, R., Advances in Neural Information Processing Systems 32 (2019), Curran Associates, Inc), 2937-2947
[5] Choromanska, A.; Henaff, M.; Mathieu, M.; Ben Arous, G.; LeCun, Y., The loss surfaces of multilayer networks, (Lebanon, G.; Vishwanathan, S. V.N., Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, PMLR, vol. 38 (2015)), 192-204
[6] Du, S. S.; Zhai, X.; Poczos, B.; Singh, A., Gradient descent provably optimizes over-parameterized neural networks, (International Conference on Learning Representations (2019))
[7] Fukumizu, K.; Amari, S., Local minima and plateaus in hierarchical structures of multilayer perceptrons, Neural Netw., 13, 3, 317-327 (2000)
[8] Hanin, B., Which neural net architectures give rise to exploding and vanishing gradients?, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems 31 (2018), Curran Associates, Inc), 582-591
[9] Hanin, B.; Rolnick, D., How to start training: The effect of initialization and architecture, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems 31 (2018), Curran Associates, Inc.), 571-581
[10] Jentzen, A.; Welti, T., Overall error analysis for the training of deep neural networks via stochastic gradient descend with random initialisation (2020), arXiv:2003.01291
[11] Kawaguchi, K., Deep learning without poor local minima, (Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; Garnett, R., Advances in Neural Information Processing Systems 29 (2016), Curran Associates, Inc.), 586-594
[12] Livni, R.; Shalev-Shwartz, S.; Shamir, O., On the computational efficiency of training neural networks, (Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; Weinberger, K. Q., Advances in Neural Information Processing Systems 27 (2014), Curran Associates, Inc.), 855-863
[13] Lu, L.; Shin, Y.; Su, Y.; Karniadakis, G. E., Dying ReLU and initialization: Theory and numerical examples (2019), arXiv:1903.06733v2
[14] Safran, I.; Shamir, O., On the quality of the initial basin in overspecified neural networks, (Balcan, M. F.; Weinberger, K. Q., Proceedings of the 33rd International Conference on Machine Learning. Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, vol. 48 (2016)), 774-782
[15] Safran, I.; Shamir, O., Spurious local minima are common in two-layer ReLU neural networks, (Dy, J.; Krause, A., Proceedings of the 35th International Conference on Machine Learning. Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, vol. 80 (2018)), 4433-4441
[16] Shamir, O., Exponential convergence time of gradient descent for one-dimensional deep linear neural networks, (Beygelzimer, A.; Hsu, D., Proceedings of the Thirty-second Conference on Learning Theory. Proceedings of the Thirty-second Conference on Learning Theory, Proceedings of Machine Learning Research, PMLR, vol. 99 (2019)), 2691-2713
[17] Shin, Y.; Karniadakis, G. E., Trainability of relu networks and data-dependent initialization, J. Mach. Learn. Model. Comput., 1, 1, 39-74 (2020)
[18] Soltanolkotabi, M.; Javanmard, A.; Lee, J. D., Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Trans. Inform. Theory, 65, 2, 742-769 (2019) · Zbl 1428.68255
[19] Soudry, D.; Carmon, Y., No bad local minima: Data independent training error guarantees for multilayer neural networks (2016), arXiv:1605.08361v2
[20] Soudry, D.; Hoffer, E., Exponentially vanishing sub-optimal local minima in multilayer neural networks (2017), arXiv:1702.05777v5
[21] Swirszcz, G.; Czarnecki, W. M.; Pascanu, R., Local minima in training of neural networks (2016), arXiv:1611.06310v2
[22] W, E.; Ma, C.; Wu, L., A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math. (2020) · Zbl 1453.68163
[23] Zou, D.; Cao, Y.; Zhou, D.; Gu, Q., Gradient descent optimizes over-parameterized deep relu networks, Mach. Learn., 109, 3, 467-492 (2020) · Zbl 1494.68245
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.