×

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions. (English) Zbl 1502.65037

Summary: Gradient descent (GD) optimization algorithms are the standard ingredients that are used to train artificial neural networks (ANNs). However, even in the case of the most basic variant of GD optimization algorithms, the plain vanilla GD method, it remains until today an open problem to prove or disprove the conjecture that GD converges in the training of ANNs. In this article we solve this problem in the special situation where the target function under consideration is a constant function. More specifically, in the case of constant target functions we prove in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the GD method does indeed converge to zero. A key contribution of this work is also to explicitly specify a Lyapunov function for the gradient flow system of the ANN parameters. This Lyapunov function is the central tool in our convergence proof.

MSC:

65K10 Numerical optimization and variational techniques
41A30 Approximation by other special function classes
68T07 Artificial neural networks and deep learning

References:

[1] Akyildiz, Ömer Deniz; Sabanis, Sotirios, Nonasymptotic analysis of stochastic gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization (2021)
[2] Allen-Zhu, Zeyuan; Li, Yuanzhi; Liang, Yingyu, Learning and generalization in overparameterized neural networks, going beyond two layers, (Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Advances in Neural Information Processing Systems, vol. 32 (2019), Curran Associates, Inc.), 6158-6169
[3] Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao, A convergence theory for deep learning via over-parameterization, (Chaudhuri, Kamalika; Salakhutdinov, Ruslan, Proceedings of the 36th International Conference on Machine Learning. Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97 (09-15 Jun 2019)), 242-252, PMLR
[4] Bach, Francis, Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res., 18, 19, 1-53 (2017) · Zbl 1433.68390
[5] Bach, Francis; Moulines, Eric, Non-strongly-convex smooth stochastic approximation with convergence rate \(O(1 / n)\), (Burges, C. J.C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q., Advances in Neural Information Processing Systems, vol. 26 (2013), Curran Associates, Inc.), 773-781
[6] Beck, Christian; Jentzen, Arnulf; Kuckuck, Benno, Full error analysis for the training of deep neural networks, Infin. Dimens. Anal. Quantum Probab. Relat. Top. (2022), To appear in · Zbl 1492.65132
[7] Cao, Yuan; Gu, Quanquan, Generalization bounds of stochastic gradient descent for wide and deep neural networks, (Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Advances in Neural Information Processing Systems, vol. 32 (2019), Curran Associates, Inc.)
[8] Cao, Yuan; Gu, Quanquan, Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks, (Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34(04) (2020)), 3349-3356
[9] Cheridito, Patrick; Jentzen, Arnulf; Rossmannek, Florian, Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions, J. Nonlinear Sci. (2021), Minor revision requested from · Zbl 1494.65044
[10] Cheridito, Patrick; Jentzen, Arnulf; Rossmannek, Florian, Non-convergence of stochastic gradient descent in the training of deep neural networks, J. Complex., 64, Article 101540 pp. (2021), 10 · Zbl 1494.65044
[11] Chizat, Lénaïc; Bach, Francis, On the global convergence of gradient descent for over-parameterized models using optimal transport, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 3036-3046
[12] Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu, Gradient descent finds global minima of deep neural networks, (Chaudhuri, Kamalika; Salakhutdinov, Ruslan, Proceedings of the 36th International Conference on Machine Learning. Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA. Proceedings of the 36th International Conference on Machine Learning. Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97 (6 2019)), 1675-1685, PMLR
[13] Du, Simon S.; Zhai, Xiyu; Poczós, Barnabás; Singh, Aarti, Gradient descent provably optimizes over-parameterized neural networks (2018)
[14] E, Weinan; Ma, Chao; Wu, Lei, A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math., 63, 7, 1235-1258 (2020) · Zbl 1453.68163
[15] Fehrman, Benjamin; Gess, Benjamin; Jentzen, Arnulf, Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res., 21, 136, 1-48 (2020) · Zbl 1520.68143
[16] Hanin, Boris, Which neural net architectures give rise to exploding and vanishing gradients?, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 582-591
[17] Hanin, Boris; Rolnick, David, How to start training: the effect of initialization and architecture, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 571-581
[18] Hutzenthaler, Martin; Jentzen, Arnulf; Pohl, Katharina; Riekert, Adrian; Scarpa, Luca, Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions (2021)
[19] Jacot, Arthur; Gabriel, Franck; Hongler, Clement, Neural tangent kernel: convergence and generalization in neural networks, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 8571-8580
[20] Jentzen, Arnulf; Kuckuck, Benno; Neufeld, Ariel; von Wurstemberger, Philippe, Strong error analysis for stochastic gradient descent optimization algorithms, IMA J. Numer. Anal., 41, 1, 455-492 (2021) · Zbl 1460.65071
[21] Jentzen, Arnulf; Riekert, Adrian, Convergence analysis for gradient flows in the training of artificial neural networks with ReLU activation (2021)
[22] Jentzen, Arnulf; Riekert, Adrian, A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions, Z. Angew. Math. Phys. (2021), Minor revision requested from
[23] Jentzen, Arnulf; von Wurstemberger, Philippe, Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates, J. Complex., 57, Article 101438 pp. (2020) · Zbl 1433.68353
[24] Lei, Y.; Hu, T.; Li, G.; Tang, K., Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE Trans. Neural Netw. Learn. Syst., 31, 10, 4394-4400 (2020)
[25] Li, Yuanzhi; Liang, Yingyu, Learning overparameterized neural networks via stochastic gradient descent on structured data, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 8157-8166
[26] Lovas, Attila; Lytras, Iosif; Rásonyi, Miklós; Sabanis, Sotirios, Taming neural networks with TUSLA: non-convex learning via adaptive stochastic gradient Langevin algorithms (2020)
[27] Lu, Lu; Shin, Yeonjong; Su, Yanhui; Karniadakis, George Em, Dying ReLU and initialization: theory and numerical examples, Commun. Comput. Phys., 28, 5, 1671-1706 (2020) · Zbl 1507.68248
[28] Moulines, Eric; Bach, Francis, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, (Shawe-Taylor, J.; Zemel, R.; Bartlett, P.; Pereira, F.; Weinberger, K. Q., Advances in Neural Information Processing Systems, vol. 24 (2011), Curran Associates, Inc.), 451-459
[29] Royden, H. L., Real Analysis (1988), Macmillan Publishing Company: Macmillan Publishing Company New York · Zbl 0704.26006
[30] Ruder, Sebastian, An overview of gradient descent optimization algorithms (2017)
[31] Sankararaman, Karthik A.; De, Soham; Xu, Zheng; Huang, W. Ronny; Goldstein, Tom, The impact of neural network overparameterization on gradient confusion and stochastic gradient descent (2020)
[32] Shin, Yeonjong; Karniadakis, George Em, Trainability of ReLU networks and data-dependent initialization, J. Mach. Learn. Model. Comp., 1, 1, 39-74 (2020)
[33] Wu, Lei; Ma, Chao; E, Weinan, How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective, (Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R., Advances in Neural Information Processing Systems, vol. 31 (2018), Curran Associates, Inc.), 8279-8288
[34] Zou, Difan; Cao, Yuan; Zhou, Dongruo; Gu, Quanquan, Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn., 109, 467-492 (2020) · Zbl 1494.68245
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.