{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T22:22:39Z","timestamp":1769638959504,"version":"3.49.0"},"reference-count":46,"publisher":"MIT Press - Journals","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Neural Computation"],"published-print":{"date-parts":[[2018,6]]},"abstract":"<jats:p> Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this letter, we focus on situations where the model is distributedly stored and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. Compared with stochastic gradient methods, it is more robust and may give better test accuracy. <\/jats:p>","DOI":"10.1162\/neco_a_01088","type":"journal-article","created":{"date-parts":[[2018,4,13]],"date-time":"2018-04-13T20:22:15Z","timestamp":1523650935000},"page":"1673-1724","source":"Crossref","is-referenced-by-count":16,"title":["Distributed Newton Methods for Deep Neural Networks"],"prefix":"10.1162","volume":"30","author":[{"given":"Chien-Chih","family":"Wang","sequence":"first","affiliation":[{"name":"Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kent Loong","family":"Tan","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chun-Ting","family":"Chen","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yu-Hsiang","family":"Lin","sequence":"additional","affiliation":[{"name":"Department of Physics, National Taiwan University, Taipei 10617, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S. Sathiya","family":"Keerthi","sequence":"additional","affiliation":[{"name":"Criteo Research, Palo Alto, CA 94301, U.S.A."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dhruv","family":"Mahajan","sequence":"additional","affiliation":[{"name":"Facebook Research, Menlo Park, CA 94025, U.S.A."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S.","family":"Sundararajan","sequence":"additional","affiliation":[{"name":"Microsoft Research India, Bangalore, Karnataka 56001, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chih-Jen","family":"Lin","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","reference":[{"key":"B1","volume-title":"Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium","author":"Alimoglu F.","year":"1996"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.1038\/ncomms5308"},{"key":"B3","doi-asserted-by":"publisher","DOI":"10.1109\/SHPCC.1994.296665"},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.1109\/72.279181"},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40994-3_6"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.1145\/130385.130401"},{"issue":"8","key":"B7","first-page":"1","volume":"91","author":"Bottou L.","year":"1991","journal-title":"Proceedings of Neuro-N\u0131mes"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-7908-2604-3_16"},{"key":"B9","doi-asserted-by":"publisher","DOI":"10.1137\/10079923X"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.1145\/1961189.1961199"},{"key":"B11","author":"Chapelle O.","year":"2011","journal-title":"NIPS Workshop on Deep Learning and Unsupervised Feature Learning"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.1162\/NECO_a_00052"},{"key":"B13","volume-title":"Advances in neural information processing systems","volume":"25","author":"Dean J.","year":"2012"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2004.03.020"},{"key":"B15","first-page":"249","author":"Glorot X.","year":"2010","journal-title":"Proceedings of the 13th International Conference on Artificial Intelligence and Statistics"},{"key":"B16","author":"Goodfellow I. J.","year":"2013","journal-title":"Pylearn2: A machine learning research library"},{"key":"B17","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.123"},{"key":"B18","author":"He X.","year":"2016","journal-title":"Large scale distributed Hessian-free optimization for deep neural network"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2205597"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.1109\/34.291440"},{"key":"B21","author":"Kiros R.","year":"2013","journal-title":"Training neural networks with stochastic Hessian-free optimization"},{"key":"B22","first-page":"1097","volume-title":"Advances in neural information processing systems","volume":"25","author":"Krizhevsky A.","year":"2012"},{"key":"B23","first-page":"265","author":"Le Q. V.","year":"2011","journal-title":"Proceedings of the 28th International Conference on Machine Learning"},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"B25","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-49430-8_2"},{"key":"B26","author":"Li P.","year":"2010","journal-title":"An empirical evaluation of four algorithms for multi-class classification: Mart, abc-mart, robust logitboost, and abc-logitboost"},{"key":"B27","volume-title":"UCI machine learning repository","author":"Lichman M.","year":"2013"},{"issue":"91","key":"B28","first-page":"1","volume":"18","author":"Mahajan D.","year":"2017","journal-title":"Journal of Machine Learning Research"},{"key":"B29","author":"Martens J.","year":"2010","journal-title":"Proceedings of the 27th International Conference on Machine Learning"},{"key":"B30","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-35289-8_27"},{"key":"B31","author":"Michie D.","year":"1994","journal-title":"Machine learning, neural and statistical classification"},{"key":"B32","author":"Moritz P.","year":"2015","journal-title":"SparkNet: Training deep networks in Spark"},{"key":"B33","author":"Netzer Y.","year":"2011","journal-title":"NIPS Workshop on Deep Learning and Unsupervised Feature Learning"},{"key":"B34","first-page":"2422","volume-title":"Advances in Neural Information Processing Systems","volume":"28","author":"Neyshabur B.","year":"2015"},{"key":"B35","volume-title":"Proceedings of Computational Intelligence Workshop","author":"Paschke F.","year":"2013"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1994.6.1.147"},{"key":"B37","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-007-0012-0"},{"key":"B38","doi-asserted-by":"publisher","DOI":"10.1016\/0041-5553(64)90137-5"},{"key":"B39","doi-asserted-by":"publisher","DOI":"10.1162\/08997660260028683"},{"key":"B40","author":"Simonyan K.","year":"2014","journal-title":"Very deep convolutional networks for large-scale image recognition"},{"key":"B41","first-page":"1139","author":"Sutskever I.","year":"2013","journal-title":"Proceedings of the 30th International Conference on Machine Learning"},{"key":"B42","first-page":"2722","author":"Taylor G.","year":"2016","journal-title":"Proceedings of the Thirty-Third International Conference on Machine Learning"},{"key":"B43","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"B44","first-page":"1058","author":"Wan L.","year":"2013","journal-title":"Proceedings of the 30th International Conference on Machine Learning"},{"key":"B45","doi-asserted-by":"publisher","DOI":"10.1162\/NECO_a_00751"},{"key":"B46","first-page":"2595","volume-title":"Advances in neural information processing systems","volume":"23","author":"Zinkevich M.","year":"2010"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/neco_a_01088","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,12]],"date-time":"2021-03-12T21:42:30Z","timestamp":1615585350000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/30\/6\/1673-1724\/8398"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,6]]},"references-count":46,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2018,6]]}},"alternative-id":["10.1162\/neco_a_01088"],"URL":"https:\/\/doi.org\/10.1162\/neco_a_01088","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"value":"0899-7667","type":"print"},{"value":"1530-888X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,6]]}}}