Document Zbl 1252.49044

Byrd, Richard H.; Chin, Gillian M.; Nocedal, Jorge; Wu, Yuchen

Sample size selection in optimization methods for machine learning. (English) Zbl 1252.49044

Math. Program. 134, No. 1 (B), 127-155 (2012).

Summary: This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an \({O(1/\epsilon)}\) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to \(L _{1}\)-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

Cited in 2 Reviews

Cited in 77 Documents

MSC:

49M15	Newton-type methods
49M37	Numerical methods based on nonlinear programming
65K05	Numerical mathematical programming methods
68T05	Learning and adaptive systems in artificial intelligence
90C30	Nonlinear programming

Keywords:

sample size selection; batch-type optimization methods; machine learning; Newton-like methods

Software:

AMLET; TRON; HOGWILD

Cite Review PDF

Full Text: DOI

References:

[1]	Agarwal, A., Duchi, J.: Distributed delayed stochastic optimization. Arxiv preprint arXiv:1104.5525 (2011)
[2]	Andrew, G., Gao, J.: Scalable training of l 1-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, pp. 33–40. ACM (2007)
[3]	Bastin F., Cirillo C., Toint P.L.: An adaptive monte carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006) · Zbl 1136.62086 · doi:10.1007/s10287-005-0044-y
[4]	Beck A., Teboulle M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009) · Zbl 1175.94009 · doi:10.1137/080716542
[5]	Bertsekas D.P.: On the Goldstein-Levitin-Poljak gradient projection method. IEEE Trans. Autom. Control AC-21, 174–184 (1976) · Zbl 0326.49025 · doi:10.1109/TAC.1976.1101194
[6]	Bottou L., Bousquet O.: The tradeoffs of large scale learning. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. MIT Press, Cambridge, MA (2008)
[7]	Byrd, R., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in unconstrained optimization. SIAM J. Optim. 21(3), 977–995 (2011) · Zbl 1245.65062
[8]	Conn A.R., Gould N.I.M., Toint P.L.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991) · Zbl 0724.65067 · doi:10.1137/0728030
[9]	Dai Y., Fletcher R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik 100(1), 21–47 (2005) · Zbl 1068.65073 · doi:10.1007/s00211-004-0569-y
[10]	Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Arxiv preprint arXiv:1012.1367 (2010) · Zbl 1283.68404
[11]	Deng G., Ferris M.C.: Variable-number sample-path optimization. Math. Program. 117(1–2), 81–109 (2009) · Zbl 1165.90013 · doi:10.1007/s10107-007-0164-y
[12]	Donoho D.: De-noising by soft-thresholding. Inf. Theory IEEE Trans. 41(3), 613–627 (1995) · Zbl 0820.62002 · doi:10.1109/18.382009
[13]	Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the Twenty Third Annual Conference on Computational Learning Theory. Citeseer (2010)
[14]	Duchi J., Singer Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009) · Zbl 1235.62151
[15]	Figueiredo M., Nowak R., Wright S.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007) · doi:10.1109/JSTSP.2007.910281
[16]	Freund J.E.: Mathematical Statistics. Prentice Hall, Englewood Cliffs, NJ (1962) · Zbl 0142.15001
[17]	Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arXiv:1104.2373 (2011)
[18]	Hager W.W., Zhang H.: A new active set algorithm for box constrained optimization. SIOPT 17(2), 526–557 (2007) · Zbl 1165.90570
[19]	Homem-de-Mello T.: Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003) · doi:10.1145/858481.858483
[20]	Kleywegt A.J., Shapiro A., Homem-de-Mello T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2001) · Zbl 0991.90090 · doi:10.1137/S1052623499363220
[21]	Lin C., Moré J. et al.: Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 9(4), 1100–1127 (1999) · Zbl 0957.65064 · doi:10.1137/S1052623498345075
[22]	Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
[23]	Nesterov Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). doi: 10.1007/s10107-007-0149-x · Zbl 1191.90038 · doi:10.1007/s10107-007-0149-x
[24]	Niu, F., Recht, B., Ré, C., Wright, S.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Arxiv preprint arXiv:1106.5730 (2011)
[25]	Polyak B., Juditsky A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838 (1992) · Zbl 0762.62022 · doi:10.1137/0330046
[26]	Polyak B.T.: The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 9, 94–112 (1969) · Zbl 0229.49023 · doi:10.1016/0041-5553(69)90035-4
[27]	Robbins H., Monro S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951) · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[28]	Shapiro A., Homem-de-Mello T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998) · Zbl 0919.90120
[29]	Shapiro A., Homem-de-Mello T.: On the rate of convergence of optimal solutions of monte carlo approximations of stochastic programs. SIAM J. Optim. 11(1), 70–86 (2000) · Zbl 0999.90023 · doi:10.1137/S1052623498349541
[30]	Shapiro A., Wardi Y.: Convergence of stochastic algorithms. Math. Oper. Res. 21(3), 615–628 (1996) · Zbl 0868.90114 · doi:10.1287/moor.21.3.615
[31]	Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)
[32]	Wright, S.: Accelerated block-coordinate relaxation for regularized optimization. Tech. rep., Computer Science Department, University of Wisconsin (2010) · Zbl 1357.49105
[33]	Wright S., Nowak R., Figueiredo M.: Sparse reconstruction by separable approximation. Signal Process. IEEE Trans. 57(7), 2479–2493 (2009) · Zbl 1391.94442 · doi:10.1109/TSP.2009.2016892
[34]	Xiao L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9999, 2543–2596 (2010) · Zbl 1242.62011

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.