{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:28:02Z","timestamp":1750220882604,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":33,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,8,5]],"date-time":"2019-08-05T00:00:00Z","timestamp":1564963200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,8,5]]},"DOI":"10.1145\/3339186.3339203","type":"proceedings-article","created":{"date-parts":[[2019,7,22]],"date-time":"2019-07-22T12:18:25Z","timestamp":1563797905000},"page":"1-9","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Reducing global reductions in large-scale distributed training"],"prefix":"10.1145","author":[{"given":"Guojing","family":"Cong","sequence":"first","affiliation":[{"name":"IBM TJ Watson Research Center, Yorktown Heights, USA"}]},{"given":"Chih-Chieh","family":"Yang","sequence":"additional","affiliation":[{"name":"IBM TJ Watson Research Center, Yorktown Heights, USA"}]},{"given":"Fan","family":"Zhou","sequence":"additional","affiliation":[{"name":"Baidu USA, Sunnyville, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,8,5]]},"reference":[{"unstructured":"T. Akiba S. Suzuki and K. Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR abs\/1711.04325 (2017). arXiv:1711.04325 http:\/\/arxiv.org\/abs\/1711.04325  T. Akiba S. Suzuki and K. Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR abs\/1711.04325 (2017). arXiv:1711.04325 http:\/\/arxiv.org\/abs\/1711.04325","key":"e_1_3_2_1_1_1"},{"doi-asserted-by":"crossref","unstructured":"L. Bottou. 1998. Online Learning and Stochastic Approximations.  L. Bottou. 1998. Online Learning and Stochastic Approximations.","key":"e_1_3_2_1_2_1","DOI":"10.1017\/CBO9780511569920.003"},{"key":"e_1_3_2_1_3_1","volume-title":"On a stochastic approximation method. The Annals of Mathematical Statistics","author":"Chung K.L.","year":"1954","unstructured":"K.L. Chung . 1954. On a stochastic approximation method. The Annals of Mathematical Statistics ( 1954 ), 463--483. K.L. Chung. 1954. On a stochastic approximation method. The Annals of Mathematical Statistics (1954), 463--483."},{"unstructured":"J. Dean G. Corrado R. Monga and etal 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25 F. Pereira C. J. C. Burges L. Bottou and K. Q. Weinberger (Eds.). Curran Associates Inc. 1223--1231.   J. Dean G. Corrado R. Monga and et al. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25 F. Pereira C. J. C. Burges L. Bottou and K. Q. Weinberger (Eds.). Curran Associates Inc. 1223--1231.","key":"e_1_3_2_1_4_1"},{"key":"e_1_3_2_1_5_1","first-page":"165","article-title":"Optimal distributed online prediction using mini-batches","author":"Dekel O.","year":"2012","unstructured":"O. Dekel , R. Gilad-Bachrach , O. Shamir , and L. Xiao . 2012 . Optimal distributed online prediction using mini-batches . Journal of Machine Learning Research 13 , Jan (2012), 165 -- 202 . O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, Jan (2012), 165--202.","journal-title":"Journal of Machine Learning Research 13"},{"unstructured":"J. Duchi E. Hazan and Y. Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011) 2121--2159. http:\/\/dl.acm.org\/citation.cfm?id=1953048.2021068   J. Duchi E. Hazan and Y. Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011) 2121--2159. http:\/\/dl.acm.org\/citation.cfm?id=1953048.2021068","key":"e_1_3_2_1_6_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_7_1","DOI":"10.1137\/120880811"},{"key":"e_1_3_2_1_8_1","volume-title":"SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677","author":"Goyal P.","year":"2017","unstructured":"P. Goyal , P. Doll\u00e1r , R.B. Girshick , and 2017 . Accurate , Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 (2017). arXiv:1706.02677 http:\/\/arxiv.org\/abs\/1706.02677 P. Goyal, P. Doll\u00e1r, R.B. Girshick, and et al. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 (2017). arXiv:1706.02677 http:\/\/arxiv.org\/abs\/1706.02677"},{"unstructured":"K. He X. Zhang S. Ren and J. Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs\/1512.03385 (2015). arXiv:1512.03385 http:\/\/arxiv.org\/abs\/1512.03385  K. He X. Zhang S. Ren and J. Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs\/1512.03385 (2015). arXiv:1512.03385 http:\/\/arxiv.org\/abs\/1512.03385","key":"e_1_3_2_1_9_1"},{"unstructured":"A.G. Howard M. Zhu B. Chen and etal 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs\/1704.04861 (2017). arXiv:1704.04861 http:\/\/arxiv.org\/abs\/1704.04861  A.G. Howard M. Zhu B. Chen and et al. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs\/1704.04861 (2017). arXiv:1704.04861 http:\/\/arxiv.org\/abs\/1704.04861","key":"e_1_3_2_1_10_1"},{"unstructured":"X. Jia S. Song W. He and etal 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. ArXiv e-prints (July 2018). arXiv:1807.11205  X. Jia S. Song W. He and et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. ArXiv e-prints (July 2018). arXiv:1807.11205","key":"e_1_3_2_1_11_1"},{"unstructured":"N. Shirish Keskar and R. Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs\/1712.07628 (2017). arXiv:1712.07628 http:\/\/arxiv.org\/abs\/1712.07628  N. Shirish Keskar and R. Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs\/1712.07628 (2017). arXiv:1712.07628 http:\/\/arxiv.org\/abs\/1712.07628","key":"e_1_3_2_1_12_1"},{"key":"e_1_3_2_1_13_1","volume-title":"Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980","author":"Kingma D.P.","year":"2015","unstructured":"D.P. Kingma and J. Ba . 2015 . Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2015). D.P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs\/1412.6980 (2015)."},{"unstructured":"A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. http:\/\/www.cs.toronto.edu\/~{}kriz\/learning-features-2009-TR.pdf  A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. http:\/\/www.cs.toronto.edu\/~{}kriz\/learning-features-2009-TR.pdf","key":"e_1_3_2_1_14_1"},{"unstructured":"X. Lian Y. Huang Y. Li and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.   X. Lian Y. Huang Y. Li and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.","key":"e_1_3_2_1_15_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_16_1","DOI":"10.1137\/070704277"},{"unstructured":"A. Paszke S. Gross S. Chintala and etal 2017. Automatic differentiation in PyTorch. (2017).  A. Paszke S. Gross S. Chintala and et al. 2017. Automatic differentiation in PyTorch. (2017).","key":"e_1_3_2_1_17_1"},{"key":"e_1_3_2_1_18_1","volume-title":"Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.","author":"Recht B.","year":"2011","unstructured":"B. Recht , C. Re , S. Wright , and F. Niu . 2011 . Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701. B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701."},{"doi-asserted-by":"crossref","unstructured":"H. Robbins and S. Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951) 400--407.  H. Robbins and S. Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951) 400--407.","key":"e_1_3_2_1_19_1","DOI":"10.1214\/aoms\/1177729586"},{"doi-asserted-by":"crossref","unstructured":"H. Robbins and D. Siegmund. 1971. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics. Elsevier 233--257.  H. Robbins and D. Siegmund. 1971. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics. Elsevier 233--257.","key":"e_1_3_2_1_20_1","DOI":"10.1016\/B978-0-12-604550-5.50015-8"},{"unstructured":"O. Russakovsky J. Deng H. Su and etal 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs\/1409.0575 (2014). arXiv:1409.0575 http:\/\/arxiv.org\/abs\/1409.0575  O. Russakovsky J. Deng H. Su and et al. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs\/1409.0575 (2014). arXiv:1409.0575 http:\/\/arxiv.org\/abs\/1409.0575","key":"e_1_3_2_1_21_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_22_1","DOI":"10.1214\/aoms\/1177706619"},{"unstructured":"O. Shamir and T. Zhang. 2013. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes.. In ICML (1). 71--79.   O. Shamir and T. Zhang. 2013. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes.. In ICML (1). 71--79.","key":"e_1_3_2_1_23_1"},{"unstructured":"K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs\/1409.1556 (2014). http:\/\/arxiv.org\/abs\/1409.1556  K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs\/1409.1556 (2014). http:\/\/arxiv.org\/abs\/1409.1556","key":"e_1_3_2_1_24_1"},{"unstructured":"C. Szegedy W. Liu Y. Jia and etal 2014. Going Deeper with Convolutions. CoRR abs\/1409.4842 (2014). arXiv:1409.4842 http:\/\/arxiv.org\/abs\/1409.4842  C. Szegedy W. Liu Y. Jia and et al. 2014. Going Deeper with Convolutions. CoRR abs\/1409.4842 (2014). arXiv:1409.4842 http:\/\/arxiv.org\/abs\/1409.4842","key":"e_1_3_2_1_25_1"},{"unstructured":"A.C. Wilson R. Roelofs M. Stern and etal 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning. ArXiv e-prints (May 2017). arXiv:stat.ML\/1705.08292  A.C. Wilson R. Roelofs M. Stern and et al. 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning. ArXiv e-prints (May 2017). arXiv:stat.ML\/1705.08292","key":"e_1_3_2_1_26_1"},{"unstructured":"Y. Wu M. Schuster Z. Chen and etal 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs\/1609.08144 (2016). arXiv:1609.08144 http:\/\/arxiv.org\/abs\/1609.08144  Y. Wu M. Schuster Z. Chen and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs\/1609.08144 (2016). arXiv:1609.08144 http:\/\/arxiv.org\/abs\/1609.08144","key":"e_1_3_2_1_27_1"},{"key":"e_1_3_2_1_28_1","volume-title":"Image Classification at Supercomputer Scale. CoRR abs\/1811.06992","author":"Ying C.","year":"2018","unstructured":"C. Ying , S. Kumar , D. Chen , and et. al. 2018. Image Classification at Supercomputer Scale. CoRR abs\/1811.06992 ( 2018 ). arXiv:1811.06992 http:\/\/arxiv.org\/abs\/1811.06992 C. Ying, S. Kumar, D. Chen, and et. al. 2018. Image Classification at Supercomputer Scale. CoRR abs\/1811.06992 (2018). arXiv:1811.06992 http:\/\/arxiv.org\/abs\/1811.06992"},{"unstructured":"Y. You I. Gitman and B. Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs\/1708.03888 (2017). arXiv:1708.03888 http:\/\/arxiv.org\/abs\/1708.03888  Y. You I. Gitman and B. Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs\/1708.03888 (2017). arXiv:1708.03888 http:\/\/arxiv.org\/abs\/1708.03888","key":"e_1_3_2_1_29_1"},{"unstructured":"C. Zhang S. Bengio M. Hardt and etal 2017. Understanding deep learning requires rethinking generalization. https:\/\/arxiv.org\/abs\/1611.03530  C. Zhang S. Bengio M. Hardt and et al. 2017. Understanding deep learning requires rethinking generalization. https:\/\/arxiv.org\/abs\/1611.03530","key":"e_1_3_2_1_30_1"},{"unstructured":"Z. Zhang L. Ma Z. Li and C. Wu. 2017. Normalized Direction-preserving Adam. CoRR abs\/1709.04546 (2017). arXiv:1709.04546 http:\/\/arxiv.org\/abs\/1709.04546  Z. Zhang L. Ma Z. Li and C. Wu. 2017. Normalized Direction-preserving Adam. CoRR abs\/1709.04546 (2017). arXiv:1709.04546 http:\/\/arxiv.org\/abs\/1709.04546","key":"e_1_3_2_1_31_1"},{"volume-title":"Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18","author":"Zhou F.","unstructured":"F. Zhou and G. Cong . 2018. On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization . In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 . International Joint Conferences on Artificial Intelligence Organization, 3219--3227. F. Zhou and G. Cong. 2018. On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 3219--3227.","key":"e_1_3_2_1_32_1"},{"unstructured":"M. Zinkevich M. Weimer A. J. Smola and L. Li. 2011. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (NIPS-10). 2595--2603. http:\/\/research.microsoft.com\/apps\/pubs\/default.aspx?id=178845   M. Zinkevich M. Weimer A. J. Smola and L. Li. 2011. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (NIPS-10). 2595--2603. http:\/\/research.microsoft.com\/apps\/pubs\/default.aspx?id=178845","key":"e_1_3_2_1_33_1"}],"event":{"sponsor":["University of Tsukuba University of Tsukuba"],"acronym":"ICPP 2019","name":"ICPP 2019: Workshops","location":"Kyoto Japan"},"container-title":["Workshop Proceedings of the 48th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3339186.3339203","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3339186.3339203","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:44:17Z","timestamp":1750203857000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3339186.3339203"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,5]]},"references-count":33,"alternative-id":["10.1145\/3339186.3339203","10.1145\/3339186"],"URL":"https:\/\/doi.org\/10.1145\/3339186.3339203","relation":{},"subject":[],"published":{"date-parts":[[2019,8,5]]},"assertion":[{"value":"2019-08-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}