{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T09:04:21Z","timestamp":1765357461828,"version":"build-2065373602"},"reference-count":49,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2023,10,20]],"date-time":"2023-10-20T00:00:00Z","timestamp":1697760000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>While first-order methods are popular for solving optimization problems arising in deep learning, they come with some acute deficiencies. To overcome these shortcomings, there has been recent interest in introducing second-order information through quasi-Newton methods that are able to construct Hessian approximations using only gradient information. In this work, we study the performance of stochastic quasi-Newton algorithms for training deep neural networks. We consider two well-known quasi-Newton updates, the limited-memory Broyden\u2013Fletcher\u2013Goldfarb\u2013Shanno (BFGS) and the symmetric rank one (SR1). This study fills a gap concerning the real performance of both updates in the minibatch setting and analyzes whether more efficient training can be obtained when using the more robust BFGS update or the cheaper SR1 formula, which\u2014allowing for indefinite Hessian approximations\u2014can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study that includes many aspects affecting performance, like batch normalization, the network architecture, the limited memory parameter or the batch size. Our results show that stochastic quasi-Newton algorithms are efficient and, in some instances, able to outperform the well-known first-order Adam optimizer, run with the optimal combination of its numerous hyperparameters, and the stochastic second-order trust-region STORM algorithm.<\/jats:p>","DOI":"10.3390\/a16100490","type":"journal-article","created":{"date-parts":[[2023,10,20]],"date-time":"2023-10-20T07:25:22Z","timestamp":1697786722000},"page":"490","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Deep Neural Networks Training by Stochastic Quasi-Newton Trust-Region Methods"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2937-9654","authenticated-orcid":false,"given":"Mahsa","family":"Yousefi","sequence":"first","affiliation":[{"name":"Department of Mathematics and Geoscienzes, University of Trieste, 34127 Trieste, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4826-1114","authenticated-orcid":false,"given":"\u00c1ngeles","family":"Mart\u00ednez","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Geoscienzes, University of Trieste, 34127 Trieste, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2023,10,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_2","first-page":"217","article-title":"Large-scale online learning","volume":"16","author":"Bottou","year":"2004","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_3","unstructured":"Defazio, A., Bach, F., and Lacoste-Julien, S. (2014, January 8\u201313). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_4","first-page":"315","article-title":"Accelerating stochastic gradient descent using predictive variance reduction","volume":"26","author":"Johnson","year":"2013","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1007\/s10107-016-1030-6","article-title":"Minimizing finite sums with the stochastic average gradient","volume":"162","author":"Schmidt","year":"2017","journal-title":"Math. Program."},{"key":"ref_6","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_7","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015\u2014Conference Track Proceedings, San Diego, CA, USA."},{"key":"ref_8","unstructured":"Ziyin, L., Li, B., and Ueda, M. (2021). SGD May Never Escape Saddle Points. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Kylasa, S., Roosta, F., Mahoney, M.W., and Grama, A. (2019, January 2\u20134). GPU accelerated sub-sampled Newton\u2019s method for convex classification problems. Proceedings of the 2019 SIAM International Conference on Data Mining, SIAM, Calgary, AB, Canada.","DOI":"10.1137\/1.9781611975673.79"},{"key":"ref_10","unstructured":"Nocedal, J., and Wright, S. (2006). Numerical Optimization, Springer Science & Business Media."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1137\/16M1080173","article-title":"Optimization methods for large-scale machine learning","volume":"60","author":"Bottou","year":"2018","journal-title":"SIAM Rev."},{"key":"ref_12","unstructured":"Martens, J. (2010, January 21\u201324). Deep learning via Hessian-Free optimization. Proceedings of the ICML, Haifa, Israel."},{"key":"ref_13","unstructured":"Martens, J., and Sutskever, I. (2012). Neural Networks: Tricks of the Trade, Springer."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"545","DOI":"10.1093\/imanum\/dry009","article-title":"Exact and inexact subsampled Newton methods for optimization","volume":"39","author":"Bollapragada","year":"2019","journal-title":"IMA J. Numer. Anal."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Xu, P., Roosta, F., and Mahoney, M.W. (2020, January 7\u20139). Second-order optimization for non-convex machine learning: An empirical study. Proceedings of the 2020 SIAM International Conference on Data Mining, SIAM, Cincinnati, OH, USA.","DOI":"10.1137\/1.9781611976236.23"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"626","DOI":"10.1137\/0720042","article-title":"The conjugate gradient method and trust-regions in large-scale optimization","volume":"20","author":"Steihaug","year":"1983","journal-title":"SIAM J. Numer. Anal."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Jahani, M., Nazari, M., Rusakov, S., Berahas, A.S., and Tak\u00e1\u010d, M. (2020, January 19\u201323). Scaling up Quasi-Newton algorithms: Communication efficient distributed SR1. Proceedings of the Machine Learning, Optimization, and Data Science: 6th International Conference, LOD 2020, Siena, Italy. Revised Selected Papers, Part I 6.","DOI":"10.1007\/978-3-030-64583-0_5"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1668","DOI":"10.1080\/10556788.2021.1977806","article-title":"Quasi-Newton methods for machine learning: Forget the past, just sample","volume":"37","author":"Berahas","year":"2022","journal-title":"Optim. Methods Softw."},{"key":"ref_19","unstructured":"Schraudolph, N.N., Yu, J., and G\u00fcnter, S. (2007, January 21\u201324). A stochastic Quasi-Newton method for online convex optimization. Proceedings of the Artificial Intelligence and Statistics, PMLR, San Juan, PR, USA."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1008","DOI":"10.1137\/140954362","article-title":"A stochastic Quasi-Newton method for large-scale optimization","volume":"26","author":"Byrd","year":"2016","journal-title":"SIAM J. Optim."},{"key":"ref_21","unstructured":"Moritz, P., Nishihara, R., and Jordan, M. (2016, January 9\u201311). A linearly-convergent stochastic L-BFGS algorithm. Proceedings of the Artificial Intelligence and Statistics, PMLR, Cadiz, Spain."},{"key":"ref_22","unstructured":"Gower, R., Goldfarb, D., and Richt\u00e1rik, P. (2016, January 9\u201311). Stochastic block BFGS: Squeezing more curvature out of data. Proceedings of the International Conference on Machine Learning, PMLR, Cadiz, Spain."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"6089","DOI":"10.1109\/TSP.2014.2357775","article-title":"RES: Regularized stochastic BFGS algorithm","volume":"62","author":"Mokhtari","year":"2014","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_24","first-page":"3151","article-title":"Global convergence of online limited memory BFGS","volume":"16","author":"Mokhtari","year":"2015","journal-title":"J. Mach. Learn. Res."},{"key":"ref_25","unstructured":"Lucchi, A., McWilliams, B., and Hofmann, T. (2015). A variance reduced stochastic Newton method. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"927","DOI":"10.1137\/15M1053141","article-title":"Stochastic Quasi-Newton methods for nonconvex stochastic optimization","volume":"27","author":"Wang","year":"2017","journal-title":"SIAM J. Optim."},{"key":"ref_27","first-page":"1055","article-title":"A multi-batch L-BFGS method for machine learning","volume":"29","author":"Berahas","year":"2016","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1080\/10556788.2019.1658107","article-title":"A robust multi-batch L-BFGS method for machine learning","volume":"35","author":"Berahas","year":"2020","journal-title":"Optim. Methods Softw."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"460","DOI":"10.1080\/10556788.2019.1624747","article-title":"Trust-region algorithms for training responses: Machine learning methods using indefinite Hessian approximations","volume":"35","author":"Erway","year":"2020","journal-title":"Optim. Methods Softw."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Rafati, J., and Marcia, R.F. (2018, January 17\u201320). Improving L-BFGS initialization for trust-region methods in deep learning. Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA.","DOI":"10.1109\/ICMLA.2018.00081"},{"key":"ref_31","unstructured":"Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.J., and Tang, P.T.P. (2018, January 10\u201315). A progressive batching L-BFGS method for machine learning. Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1287\/ijoo.2019.0016","article-title":"Convergence rate analysis of a stochastic trust-region method via supermartingales","volume":"1","author":"Blanchet","year":"2019","journal-title":"INFORMS J. Optim."},{"key":"ref_33","first-page":"2386","article-title":"Practical Quasi-Newton methods for training deep neural networks","volume":"33","author":"Goldfarb","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Conn, A.R., Gould, N.I., and Toint, P.L. (2000). Trust-Region Methods, SIAM. Available online: https:\/\/epubs.siam.org\/doi\/book\/10.1137\/1.9780898719857.","DOI":"10.1137\/1.9780898719857"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1137\/0902016","article-title":"Computing optimal locally constrained steps","volume":"2","author":"Gay","year":"1981","journal-title":"SIAM J. Sci. Stat. Comput."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"553","DOI":"10.1137\/0904038","article-title":"Computing a trust-region step","volume":"4","author":"Sorensen","year":"1983","journal-title":"SIAM J. Sci. Stat. Comput."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1007\/s12532-016-0109-7","article-title":"On efficiently combining limited-memory and trust-region techniques","volume":"9","author":"Burdakov","year":"2017","journal-title":"Math. Program. Comput."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1007\/s10589-016-9868-3","article-title":"On solving L-SR1 trust-region subproblems","volume":"66","author":"Brust","year":"2017","journal-title":"Comput. Optim. Appl."},{"key":"ref_39","unstructured":"Wang, X., and Yuan, Y.X. (2019). Stochastic trust-region methods with trust-region radius depending on probabilistic models. arXiv."},{"key":"ref_40","unstructured":"Krejic, N., Jerinkic, N.K., Mart\u00ednez, A., and Yousefi, M. (2023). A non-monotone extra-gradient trust-region method with noisy oracles. arXiv."},{"key":"ref_41","unstructured":"LeCun, Y. (2020, November 01). The MNIST Database of Handwritten Digits. Available online: https:\/\/www.kaggle.com\/datasets\/hojjatk\/mnist-dataset."},{"key":"ref_42","unstructured":"Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv."},{"key":"ref_43","unstructured":"Krizhevsky, A., and Hinton, G. (2020, November 01). Learning Multiple Layers of Features from Tiny Images. Available online: https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_46","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 7\u20139). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, PMLR, Lille, France."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1007\/s10107-017-1141-8","article-title":"Stochastic optimization using a trust-region method and random models","volume":"169","author":"Chen","year":"2018","journal-title":"Math. Program."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Adhikari, L., DeGuchy, O., Erway, J.B., Lockhart, S., and Marcia, R.F. (2017, January 6\u20139). Limited-memory trust-region methods for sparse relaxation. Proceedings of the Wavelets and Sparsity XVII. International Society for Optical Engineering, San Diego, CA, USA.","DOI":"10.1117\/12.2271369"},{"key":"ref_49","unstructured":"Golub, G.H., and Van Loan, C.F. (2013). Matrix Computations, Johns Hopkins University Press. [4th ed.]."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/10\/490\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:09:02Z","timestamp":1760130542000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/16\/10\/490"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,20]]},"references-count":49,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2023,10]]}},"alternative-id":["a16100490"],"URL":"https:\/\/doi.org\/10.3390\/a16100490","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2023,10,20]]}}}