{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T20:35:26Z","timestamp":1764102926965},"reference-count":27,"publisher":"MIT Press","issue":"7","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Neural Computation"],"published-print":{"date-parts":[[2018,7]]},"abstract":"<jats:p>We present a comprehensive framework of search methods, such as simulated annealing and batch training, for solving nonconvex optimization problems. These methods search a wider range by gradually decreasing the randomness added to the standard gradient descent method. The formulation that we define on the basis of this framework can be directly applied to neural network training. This produces an effective approach that gradually increases batch size during training. We also explain why large batch training degrades generalization performance, which previous studies have not clarified.<\/jats:p>","DOI":"10.1162\/neco_a_01089","type":"journal-article","created":{"date-parts":[[2018,4,13]],"date-time":"2018-04-13T20:22:15Z","timestamp":1523650935000},"page":"2005-2023","source":"Crossref","is-referenced-by-count":10,"title":["Why Does Large Batch Training Result in Poor Generalization? A Comprehensive Explanation and a Better Strategy from the Viewpoint of Stochastic Optimization"],"prefix":"10.1162","volume":"30","author":[{"given":"Tomoumi","family":"Takase","sequence":"first","affiliation":[{"name":"Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido 060-0814, Japan"}]},{"given":"Satoshi","family":"Oyama","sequence":"additional","affiliation":[{"name":"Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido 060-0814, Japan"}]},{"given":"Masahito","family":"Kurihara","sequence":"additional","affiliation":[{"name":"Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido 060-0814, Japan"}]}],"member":"281","reference":[{"key":"B1","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780195131581.001.0001","volume-title":"Swarm intelligence: From natural to artificial systems","author":"Bonabeau E.","year":"1999"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.1007\/BF00940812"},{"key":"B3","author":"Dinh L.","year":"2017","journal-title":"Sharp minima can generalize for deep nets"},{"key":"B4","first-page":"2121","volume":"12","author":"Duchi J.","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"B5","first-page":"249","author":"Glorot X.","year":"2010","journal-title":"Proceedings of Artificial Intelligence and Statistics Conference"},{"key":"B6","first-page":"315","author":"Glorot X.","year":"2011","journal-title":"Proceedings of Artificial Intelligence and Statistics Conference"},{"key":"B7","doi-asserted-by":"publisher","DOI":"10.1016\/0305-0548(86)90048-1"},{"key":"B8","first-page":"448","author":"Ioffe S.","year":"2015","journal-title":"Proceedings of the 32nd International Conference on Machine Learning"},{"key":"B9","doi-asserted-by":"publisher","DOI":"10.1038\/nature10012"},{"key":"B10","volume-title":"Advances in neural information processing systems","volume":"26","author":"Johnson R.","year":"2013"},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.58.5355"},{"key":"B12","volume-title":"Advances in neural information processing systems","volume":"29","author":"Kawaguchi K.","year":"2016"},{"key":"B13","author":"Keskar N. S.","year":"2017","journal-title":"Proceedings of the International Conference on Learning Representations"},{"key":"B14","author":"Kingma D.","year":"2015","journal-title":"Proceedings of the International Conference on Learning Representations"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.1126\/science.220.4598.671"},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"B17","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-49430-8_2"},{"key":"B18","author":"Lu H.","year":"2017","journal-title":"Depth creates no bad local minima"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.1007\/BF01582166"},{"key":"B20","author":"Mishkin D.","year":"2015","journal-title":"All you need is a good init"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2015.12.114"},{"key":"B22","volume-title":"Advances in neural information processing systems","volume":"25","author":"Roux N. L.","year":"2012"},{"key":"B23","author":"Smith S. L.","year":"2017","journal-title":"Don't decay the learning rate, increase the batch size"},{"key":"B24","first-page":"1929","volume":"15","author":"Srivastava N.","year":"2014","journal-title":"Journal of Machine Learning Research"},{"key":"B25","first-page":"1139","author":"Sutskever I.","year":"2013","journal-title":"Proceedings of the 30th International Conference on Machine Learning"},{"key":"B26","author":"Tieleman T.","year":"2012","journal-title":"COURSERA: Neural networks for machine learning"},{"key":"B27","author":"Zeiler M. D.","year":"2012","journal-title":"An adaptive learning rate method"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/neco_a_01089","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,8,19]],"date-time":"2022-08-19T15:41:44Z","timestamp":1660923704000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/30\/7\/2005-2023\/8403"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,7]]},"references-count":27,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2018,7]]}},"alternative-id":["10.1162\/neco_a_01089"],"URL":"https:\/\/doi.org\/10.1162\/neco_a_01089","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"value":"0899-7667","type":"print"},{"value":"1530-888X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,7]]}}}