×

Hyperfast second-order local solvers for efficient statistically preconditioned distributed optimization. (English) Zbl 1530.90061

Summary: Statistical preconditioning enables fast methods for distributed large-scale empirical risk minimization problems. In this approach, multiple worker nodes compute gradients in parallel, which are then used by the central node to update the parameter by solving an auxiliary (preconditioned) smaller-scale optimization problem. The recently proposed Statistically Preconditioned Accelerated Gradient (SPAG) method [H. Hendrikx et al., “Statistically preconditioned accelerated gradient method for distributed optimization”, Proc. Mach. Learn. Res. (PMLR) 119, 4203–4227 (2020)] has complexity bounds superior to other such algorithms but requires an exact solution for computationally intensive auxiliary optimization problems at every iteration. In this paper, we propose an Inexact SPAG (InSPAG) and explicitly characterize the accuracy by which the corresponding auxiliary subproblem needs to be solved to guarantee the same convergence rate as the exact method. We build our results by first developing an inexact adaptive accelerated Bregman proximal gradient method for general optimization problems under relative smoothness and strong convexity assumptions, which may be of independent interest. Moreover, we explore the properties of the auxiliary problem in the InSPAG algorithm assuming Lipschitz third-order derivatives and strong convexity. For such problem class, we develop a linearly convergent Hyperfast second-order method and estimate the total complexity of the InSPAG method with hyperfast auxiliary problem solver. Finally, we illustrate the proposed method’s practical efficiency by performing large-scale numerical experiments on logistic regression models. To the best of our knowledge, these are the first empirical results on implementing high-order methods on large-scale problems, as we work with data where the dimension is of the order of 3 million, and the number of samples is 700 million.

MSC:

90C15 Stochastic programming
90C30 Nonlinear programming
65K05 Numerical mathematical programming methods
90C06 Large-scale problems in mathematical programming

References:

[1] Hendrikx, H.; Xiao, L.; Bubeck, S.; Bach, F.; Massoulie, L., Statistically preconditioned accelerated gradient method for distributed optimization, (Proceedings of the 37th International Conference on Machine Learning. Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, vol. 119 (2020)), 4203-4227
[2] Shamir, O.; Srebro, N.; Zhang, T., Communication-efficient distributed optimization using an approximate Newton-type method, (Proceedings of the 31st International Conference on Machine Learning. Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Beijing, China, vol. 32 (2014)), 1000-1008
[3] Yuan, X.-T.; Li, P., On convergence of distributed approximate Newton methods: globalization, sharper bounds and beyond, J. Mach. Learn. Res., 21, 206, 1-51 (2020) · Zbl 1529.68273
[4] Wang, S.; Roosta, F.; Xu, P.; Mahoney, M. W., Giant: globally improved approximate Newton method for distributed optimization, (Advances in Neural Information Processing Systems (2018)), 2332-2342
[5] Hendrikx, H.; Bach, F.; Massoulié, L., An optimal algorithm for decentralized finite-sum optimization, SIAM J. Control Optim., 31, 4, 2753-2783 (2021) · Zbl 1532.90073
[6] Yang, T., Trading computation for communication: distributed stochastic dual coordinate ascent, (Advances in Neural Information Processing Systems (2013)), 629-637
[7] Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; Su, B.-Y., Scaling distributed machine learning with the parameter server, (11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (2014)), 583-598
[8] Dean, J.; Ghemawat, S., Mapreduce: simplified data processing on large clusters, Commun. ACM, 51, 1, 107-113 (2008)
[9] Lan, G.; Lee, S.; Zhou, Y., Communication-efficient algorithms for decentralized and stochastic optimization, Math. Program., 1-48 (2018)
[10] Nesterov, Y., Lectures on Convex Optimization, vol. 137 (2018), Springer · Zbl 1427.90003
[11] Reddi, S. J.; Konečnỳ, J.; Richtárik, P.; Póczós, B.; Smola, A., Aide: fast and communication efficient distributed optimization, arXiv preprint
[12] Zhang, Y.; Lin, X., Disco: distributed optimization for self-concordant empirical loss, (Proceedings of the 32nd International Conference on Machine Learning. Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Lille, France, vol. 37 (2015)), 362-370
[13] Lin, H.; Mairal, J.; Harchaoui, Z., A universal catalyst for first-order optimization, (Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2. Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, NIPS’15 (2015), MIT Press: MIT Press Cambridge, MA, USA), 3384-3392
[14] Dragomir, R.-A.; Taylor, A.; d’Aspremont, A.; Bolte, J., Optimal complexity and certification of Bregman first-order methods, Math. Program., 194, 1, 41-83 (2022) · Zbl 1494.90076
[15] Arjevani, Y.; Shamir, O., Communication complexity of distributed convex learning and optimization, (Advances in Neural Information Processing Systems, vol. 28 (2015), Curran Associates, Inc.), 1756-1764
[16] Sun, Y.; Scutari, G.; Daneshmand, A., Distributed optimization based on gradient tracking revisited: enhancing convergence rate via surrogation, SIAM J. Control Optim., 32, 2, 354-385 (2022) · Zbl 07510407
[17] Bullins, B., Highly smooth minimization of non-smooth problems, (Proceedings of Thirty Third Conference on Learning Theory. Proceedings of Thirty Third Conference on Learning Theory, Proceedings of Machine Learning Research, PMLR, vol. 125 (2020)), 988-1030
[18] Birgin, E. G.; Gardenghi, J. L.; Martínez, J. M.; Santos, S. A.; Toint, P. L., Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models, Math. Program., 163, 1, 359-368 (2017) · Zbl 1365.90236
[19] Carmon, Y.; Duchi, J. C.; Hinder, O.; Sidford, A., Lower bounds for finding stationary points I, Math. Program., 184, 1, 71-120 (2020) · Zbl 1451.90128
[20] Cartis, C.; Gould, N. I.; Toint, P. L., Universal regularization methods: varying the power, the smoothness and the accuracy, SIAM J. Control Optim., 29, 1, 595-615 (2019) · Zbl 1436.90136
[21] Baes, M., Estimate Sequence Methods: Extensions and Approximations (2009), Institute for Operations Research, ETH: Institute for Operations Research, ETH Zürich, Switzerland
[22] Nesterov, Y., Implementable tensor methods in unconstrained convex optimization, Math. Program., 1-27 (2019)
[23] Gasnikov, A.; Dvurechensky, P.; Gorbunov, E.; Vorontsova, E.; Selikhanovych, D.; Uribe, C. A.; Jiang, B.; Wang, H.; Zhang, S.; Bubeck, S.; Jiang, Q.; Lee, Y. T.; Li, Y.; Sidford, A., Near optimal methods for minimizing convex functions with Lipschitz p-th derivatives, (Proceedings of the Thirty-Second Conference on Learning Theory. Proceedings of the Thirty-Second Conference on Learning Theory, Proceedings of Machine Learning Research, PMLR, Phoenix, USA, vol. 99 (2019)), 1392-1393
[24] Nesterov, Y., Superfast second-order methods for unconstrained convex optimization, J. Optim. Theory Appl., 1, 1-30 (2021) · Zbl 1480.90195
[25] Nesterov, Y., Inexact high-order proximal-point methods with auxiliary search procedure, SIAM J. Control Optim., 31, 4, 2807-2828 (2021) · Zbl 1481.90255
[26] Kamzolov, D.; Gasnikov, A., Near-optimal hyperfast second-order method for convex optimization and its sliding, arXiv preprint
[27] Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C., Introduction to Algorithms (2009), MIT Press · Zbl 1187.68679
[28] Huang, J.; Smith, T. M.; Henry, G. M.; van de Geijn, R. A., Strassen’s algorithm reloaded, (SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2016), IEEE), 690-701
[29] Bauschke, H. H.; Bolte, J.; Teboulle, M., A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Math. Oper. Res., 42, 2, 330-348 (2016) · Zbl 1364.90251
[30] Lu, H.; Freund, R. M.; Nesterov, Y., Relatively smooth convex optimization by first-order methods, and applications, SIAM J. Control Optim., 28, 1, 333-354 (2018) · Zbl 1392.90090
[31] Stonyakin, F.; Tyurin, A.; Gasnikov, A.; Dvurechensky, P.; Agafonov, A.; Dvinskikh, D.; Alkousa, M.; Pasechnyuk, D.; Artamonov, S.; Piskunova, V., Inexact model: a framework for optimization and variational inequalities, Optim. Methods Softw., 36, 6, 1155-1201 (2021) · Zbl 1489.65089
[32] Ben-Tal, A.; Nemirovski, A., Lectures on Modern Convex Optimization (Lecture Notes), Personal web-page of A. Nemirovski (2020)
[33] Devolder, O.; Glineur, F.; Nesterov, Y., First-order methods of smooth convex optimization with inexact oracle, Math. Program., 146, 1, 37-75 (2014) · Zbl 1317.90196
[34] Dvurechensky, P.; Gasnikov, A., Stochastic intermediate gradient method for convex problems with stochastic inexact oracle, J. Optim. Theory Appl., 171, 1, 121-145 (2016) · Zbl 1351.90150
[35] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2, 1, 183-202 (2009) · Zbl 1175.94009
[36] Nesterov, Y., Gradient methods for minimizing composite functions, Math. Program., 140, 1, 125-161 (2013), first appeared in 2007 as CORE discussion paper 2007/76 · Zbl 1287.90067
[37] Hanzely, F.; Richtárik, P.; Xiao, L., Accelerated Bregman proximal gradient methods for relatively smooth convex optimization, Comput. Optim. Appl., 79, 2, 405-440 (2021) · Zbl 1473.90114
[38] Florea, M. I., Exact gradient methods with memory, Optim. Methods Softw., 1-28 (2022)
[39] Bauschke, H. H.; Bolte, J.; Teboulle, M., A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Math. Oper. Res., 42, 2, 330-348 (2017) · Zbl 1364.90251
[40] Scaman, K.; Bach, F.; Bubeck, S.; Lee, Y. T.; Massoulié, L., Optimal algorithms for smooth and strongly convex distributed optimization in networks, (Proceedings of the 34th International Conference on Machine Learning. Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, vol. 70 (2017)), 3027-3036
[41] Gasnikov, A. V.; Nesterov, Y. E., Universal method for stochastic composite optimization problems, Comput. Math. Math. Phys., 58, 1, 48-64 (2018) · Zbl 1457.90099
[42] Nesterov, Y., Lectures on Convex Optimization, Springer Optimization and Its Applications, vol. 137 (2018), Springer International Publishing · Zbl 1427.90003
[43] Dvurechensky, P.; Gasnikov, A.; Kroshnin, A., Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm, (Proceedings of the 35th International Conference on Machine Learning. Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80 (2018)), 1367-1376
[44] Dvurechensky, P.; Dvinskikh, D.; Gasnikov, A.; Uribe, C. A.; Nedić, A., Decentralize and randomize: faster algorithm for Wasserstein barycenters, (Advances in Neural Information Processing Systems, vol. 31. Advances in Neural Information Processing Systems, vol. 31, NIPS’18 (2018), Curran Associates, Inc.), 10783-10793
[45] Dvurechensky, P.; Shtern, S.; Staudigl, M., First-order methods for convex optimization, EURO J. Comput. Optim., 9, Article 100015 pp. (2021) · Zbl 1516.90048
[46] Lin, Q.; Xiao, L., An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization, (Proceedings of the 31st International Conference on Machine Learning. Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, Bejing, China, vol. 32 (2014)), 73-81
[47] Monteiro, R. D.; Svaiter, B. F., An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods, SIAM J. Control Optim., 23, 2, 1092-1125 (2013) · Zbl 1298.90071
[48] Nesterov, Y., Smooth minimization of non-smooth functions, Math. Program., 103, 1, 127-152 (2005) · Zbl 1079.90102
[49] Lan, G., First-Order and Stochastic Optimization Methods for Machine Learning (2020), Springer · Zbl 1442.68003
[50] Doikov, N.; Nesterov, Y., Contracting proximal methods for smooth convex optimization, SIAM J. Control Optim., 30, 4, 3146-3169 (2020) · Zbl 1454.90052
[51] Nesterov, Y., Inexact basic tensor methods for some classes of convex optimization problems, Optim. Methods Softw., 1-29 (2020)
[52] Gasnikov, A., Universal gradient descent, arXiv preprint
[53] Doikov, N.; Nesterov, Y., Inexact tensor methods with dynamic accuracies, (Proceedings of the 37th International Conference on Machine Learning. Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, vol. 119 (2020)), 2577-2586
[54] Agafonov, A.; Kamzolov, D.; Dvurechensky, P.; Gasnikov, A., Inexact tensor methods and their application to stochastic convex optimization · Zbl 07895156
[55] Kamzolov, D.; Gasnikov, A.; Dvurechensky, P., Optimal combination of tensor optimization methods, (Optimization and Applications (2020), Springer International Publishing: Springer International Publishing Cham), 166-183 · Zbl 1511.90388
[56] Lewis, D. D.; Yang, Y.; Rose, T. G.; Li, F., Rcv1: a new benchmark collection for text categorization research, J. Mach. Learn. Res., 5, 361-397 (Apr 2004)
[57] Apache, Spark 2.4.5 (2020)
[58] Pytorch (2020), 1.5.0
[59] Kamzolov, D., Near-optimal hyperfast second-order method for convex optimization, (Mathematical Optimization Theory and Operations Research (2020), Springer International Publishing: Springer International Publishing Cham), 167-178 · Zbl 1460.90133
[60] Kingma, D. P.; Ba, J., Adam: a method for stochastic optimization, arXiv preprint
[61] Shalev-Shwartz, S., Sdca without duality, regularization, and individual convexity, (Proceedings of the 33rd International Conference on Machine Learning. Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, New York, New York, USA, vol. 48 (2016)), 747-754
[62] Shamir, O.; Srebro, N.; Zhang, T., Communication-efficient distributed optimization using an approximate Newton-type method, (International Conference on Machine Learning (2014)), 1000-1008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.