Document Zbl 1494.65047

Stochastic generalized gradient methods for training nonconvex nonsmooth neural networks. (English) Zbl 1494.65047

Cybern. Syst. Anal. 57, No. 5, 714-729 (2021) and Kibern. Sist. Anal. 57, No. 5, 54-71 (2021).

Summary: The paper observes a similarity between the stochastic optimal control of discrete dynamical systems and the learning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems. As a model of nonsmooth nonconvex dependences, the so-called generalized-differentiable functions are used. The backpropagation method for calculating stochastic generalized gradients of the learning quality functional for such systems is substantiated basing on Hamilton-Pontryagin formalism. Stochastic generalized gradient learning algorithms are extended for training nonconvex nonsmooth neural networks. The performance of a stochastic generalized gradient algorithm is illustrated by the linear multiclass classification problem.

Cited in 1 Document

MSC:

65K05	Numerical mathematical programming methods
68T07	Artificial neural networks and deep learning
90C15	Stochastic programming
90C26	Nonconvex programming, global optimization

Keywords:

machine learning; deep learning; multilayer neural networks; nonsmooth nonconvex optimization; stochastic optimization; stochastic generalized gradient

Cite Review PDF

Full Text: DOI

References:

[1]	Rumelhart, DE; Hinton, GE; Williams, RJ, Learning representations by back-propagating errors, Nature, 323, 533-536 (1986) · Zbl 1369.68284 · doi:10.1038/323533a0
[2]	Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient BackProp,” in: G. Montavon, G. B. Orr, and K.-R. Muller (eds.). “NN: Tricks of the trade,” LNCS, Vol. 7700, Springer-Verlag, Berlin-Heidelberg (2012), pp. 9-48.
[3]	Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” 436, Nature, Vol. 521, 436-444 (2015). doi:10.1038/nature14539.
[4]	P. Flach, Machine Learning. The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press (2012). · Zbl 1267.68010
[5]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press (2016). · Zbl 1373.68009
[6]	Nikolenko, S.; Kadurin, A.; Arhangelskaia, E., Deep Learning [in Russian] (2018), St. Petersburg: Piter, St. Petersburg
[7]	K. V. Vorontsov, Machine Learning: A Year Course, URL: http://www.machinelearning.ru/wiki/ (Last accessed: 07.11.2019).
[8]	Schmidhuber, J., Deep learning in neural networks: An overview, Neural Networks, 61, 85-117 (2015) · doi:10.1016/j.neunet.2014.09.003
[9]	Bottou, L.; Curtisy, FE; Nocedalz, J., Optimization methods for large-scale machine learning, SIAM Rev., 60, 2, 223-311 (2018) · Zbl 1397.65085 · doi:10.1137/16M1080173
[10]	C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of deep learning IIb: Optimization properties of SGD,” CBMM Memo, No. 072, Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA (2018). arXiv:1801.02254v1[cs.LG] 7 Jan 2018.
[11]	D. Newton, F. Yousefian, and R. Pasupathy, “Stochastic gradient descent: Recent trends,” INFORMS TutORials in Operations Research (2018), pp. 193-220. doi:10.1287/educ.2018.0191.
[12]	Robbins, H.; Monro, S., A stochastic approximation method, The Annals of Mathematical Statistics, 22, 3, 400-407 (1951) · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[13]	Ermoliev, YM, Methods of Stochastic Programming [in Russian] (1976), Moscow: Nauka, Moscow
[14]	Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A., Robust stochastic approximation approach to stochastic programming, SIAM J. on Optimization, 19, 4, 1574-1609 (2009) · Zbl 1189.90109 · doi:10.1137/070704277
[15]	Shapiro, A.; Dentcheva, D.; Ruszczyński, A., Lectures on Stochastic Programming: Modeling and Theory (2009), Philadelphia: SIAM, Philadelphia · Zbl 1183.90005 · doi:10.1137/1.9780898718751
[16]	D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, “Stochastic subgradient method converges on tame functions,” Found. Comput. Math., Vol. 20, Iss. 1, 119-154 (2020). doi:10.1007/s10208-018-09409-5. · Zbl 1433.65141
[17]	R. Zhu, D. Niu, and Z. Li, “Asynchronous stochastic proximal methods for nonconvex nonsmooth optimization,” arXiv:1802.08880v3 [cs.LG] 15 Sep 2018.
[18]	Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (2018), pp. 5564-5574.
[19]	S. Majewski, B. Miasojedow, and E. Moulines, “Analysis of nonsmooth stochastic approximation: The diferential inclusion approach,” arXiv:1805.01916v1 [math.OC], May 4 (2018).
[20]	V. Kungurtsev, M. Egan, B. Chatterjee, and D. Alistarh, “Asynchronous stochastic subgradient methods for general nonsmooth nonconvex optimization,” arXiv:1905.11845, May 28 (2019).
[21]	D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim., Vol. 29, Iss.1, 207-239 (2019). doi:10.1137/18M1178244. · Zbl 1415.65136
[22]	F. H. Clarke, “Optimization and nonsmooth analysis,” in: Classics in Applied Mathematics, Vol. 5, Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990). · Zbl 0696.49002
[23]	Norkin, VI, Generalized-differentiable functions, Cybernetics, 16, 1, 10-12 (1980) · doi:10.1007/BF01099354
[24]	Mikhalevich, VS; Gupal, AM; Norkin, VI, Methods of Nonconvex Optimization [in Russian] (1987), Moscow: Nauka, Moscow · Zbl 0635.90054
[25]	Nurminskii, EA, Minimization of nondifferentiable functions in the presence of noise, Cybernetics, 10, 4, 619-621 (1974) · doi:10.1007/BF01071541
[26]	Nurminski, EA, Numerical Methods for Solving Stochastic Minimax Problems [in Russian] (1979), Kyiv: Naukova Dumka, Kyiv · Zbl 0452.65041
[27]	Ermoliev, YM; Norkin, VI, Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems, Cybern. Syst. Analysis, 34, 2, 196-215 (1998) · Zbl 0930.90074 · doi:10.1007/BF02742069
[28]	V. I. Norkin, Stochastic Methods for Solving Nonconvex Stochastic Optimization Problems and their Applications [in Ukrainian], Extended Abstract, PhD Thesis, Kyiv (1998). http://library.nuft.edu.ua/ebook/file/01.05.01
[29]	Ermoliev, YM; Norkin, VI, Solution of nonconvex nonsmooth stochastic optimization problems, Cybern. Syst. Analysis, 39, 5, 701-715 (2003) · Zbl 1066.90071 · doi:10.1023/B:CASA.0000012091.84864.65
[30]	J. Burke, A. Lewis, and M. Overton, “A robust gradient sampling algorithm for nonsmooth nonconvex optimization,” SIAM J. Opt., Vol. 15, Iss. 3, 751-779 (2005). · Zbl 1078.65048
[31]	S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall (2006). · Zbl 0934.68076
[32]	K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in: Proc. IEEE 12th Intern. Conf. on Computer Vision (ICCV), Sept. 29-Oct. 2, 2009, Kyoto, Japan, Kyoto (2009), pp. 2146-2153. doi:10.1109/iccv.2009.5459469.
[33]	Norkin, VI, Nonlocal minimization algorithms of nondifferentiable functions, Cybernetics, 14, 5, 704-707 (1978) · Zbl 0433.65030 · doi:10.1007/BF01069307
[34]	Mifflin, R., An algorithm for constrained optimization with semi-smooth functions, Math. Oper. Res., 2, 2, 191-207 (1977) · Zbl 0395.90069 · doi:10.1287/moor.2.2.191
[35]	J. Bolte, A. Daniilidis, and A. Lewis, “Tame functions are semismooth,” Math. Program., Vol. 117, Iss. 1-2, 5-19 (2009). doi:10.1007/s10107-007-0166-9. · Zbl 1158.49030
[36]	Norkin, VI, Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization, Cybernetics, 22, 6, 804-809 (1986) · Zbl 0693.90074 · doi:10.1007/BF01068698
[37]	Norkin, VI, Generalized gradients in problems of dynamic optimization, optimal control, and machine learning, Cybern. Syst. Analysis, 56, 2, 243-258 (2020) · Zbl 1464.49016 · doi:10.1007/s10559-020-00240-x
[38]	Pontryagin, LS; Boltyanskii, VG; Gamkrelidze, RV; Mishchenko, EF, The Mathematical Theory of Optimal Processes (1962), New York-London: Interscience Publishers, New York-London · Zbl 0102.32001
[39]	Boltyanskii, VG, Optimal Control of Discrete Systems [in Russian] (1973), Moscow: Nauka, Moscow
[40]	Yu. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method with application to insurance risk management,” Interim Report IR-97-021, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1997). URL: http://pure.iiasa.ac.at/id/eprint/5270/1/IR-97-021.pdf.
[41]	Gel’fand, IM; Tzeitlin, ML, The principle of the nonlocal search in automatic optimization systems, Dokl. Akad. Nauk SSSR, 137, 2, 295-298 (1961)
[42]	Nesterov, Y., A method of solving a convex programming problem with convergence rate O(1/k)^2, Soviet Math. Dokl., 27, 2, 372-376 (1983) · Zbl 0535.90071
[43]	Urjas’ev, SP, Step control for direct stochastic-programming methods, Cybernetics, 16, 6, 886-890 (1980) · Zbl 0475.90066 · doi:10.1007/BF01069063
[44]	S. P. Urjas’ev, “Stochastic quasi-gradient algorithms with adaptively controlled parameters,” Working Paper WP-86-32, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1986). URL: http://pure.iiasa.ac.at/id/eprint/2827/1/WP-86-032.pdf.
[45]	S. P. Urjas’ev, Adaptive Algorithms of Stochastic Optimization and Game Theory [in Russian], Nauka, Moscow (1990). · Zbl 0709.90073
[46]	Gupal, AM, Stochastic Methods for Solution of Nonsmooth Extremum Problems [in Russian] (1979), Kyiv: Naukova Dumka, Kyiv
[47]	O. N. Granichin and B. T. Polyak, Randomized Algorithms of Optimization and Estimation under Almost Arbitrary Noise [in Russian], Nauka, Moscow (2003).
[48]	Vapnik, VN, Statistical Learning Theory (1998), New York: Wiley & Sons, New York · Zbl 0935.62007
[49]	M. Shlezinger and V. Glavach, Ten Lectures on the Statistical and Structural Recognition [in Russian], Nakova Dumka, Kyiv (2004).
[50]	G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proc. IEEE, Vol. 100, Iss. 9, 2584-2603 (2012). doi:10.1109/JPROC.2012.2188013.
[51]	Laptin, YP; Zhuravlev, YI; Vinogradov, AP, Empirical risk minimization and problems of constructing linear classifiers, Cybern. Syst. Analysis, 47, 4, 640-648 (2011) · Zbl 1306.90102 · doi:10.1007/s10559-011-9344-0
[52]	Zhuravlev, YI; Laptin, YP; Vinogradov, AP; Zhurbenko, NG; Lykhovyd, OP; Berezovskyi, OA, Linear classifiers and selection of informative features, Pattern Recognition and Image Analysis, 27, 3, 426-432 (2017) · doi:10.1134/S1054661817030336
[53]	Zhuravlev, YI; Laptin, YP; Vinogradov, AP, A comparison of some approaches to classification problems, and possibilities to construct optimal solutions efficiently, Pattern Recognition and Image Analysis, 24, 2, 189-195 (2014) · doi:10.1134/S1054661814020175

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.