{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T20:52:43Z","timestamp":1730321563162,"version":"3.28.0"},"publisher-location":"New York, NY, USA","reference-count":42,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,8,5]],"date-time":"2019-08-05T00:00:00Z","timestamp":1564963200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,8,5]]},"DOI":"10.1145\/3339186.3339202","type":"proceedings-article","created":{"date-parts":[[2019,7,22]],"date-time":"2019-07-22T12:18:25Z","timestamp":1563797905000},"page":"1-8","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method"],"prefix":"10.1145","author":[{"given":"Yohei","family":"Tsuji","sequence":"first","affiliation":[{"name":"Tokyo Institute of Technology, Tokyo, Japan"}]},{"given":"Kazuki","family":"Osawa","sequence":"additional","affiliation":[{"name":"Tokyo Institute of Technology, Tokyo, Japan"}]},{"given":"Yuichiro","family":"Ueno","sequence":"additional","affiliation":[{"name":"Tokyo Institute of Technology, Tokyo, Japan"}]},{"given":"Akira","family":"Naruse","sequence":"additional","affiliation":[{"name":"NVIDIA, Tokyo, Japan"}]},{"given":"Rio","family":"Yokota","sequence":"additional","affiliation":[{"name":"Global Scientific Information and Computing Center, Tokyo Institute of Technology, AIST-Tokyo Tech RWBC-OIL, AIST, Tokyo, Japan"}]},{"given":"Satoshi","family":"Matsuoka","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Tokyo Institute of Technology, Kobe, Japan"}]}],"member":"320","published-online":{"date-parts":[[2019,8,5]]},"reference":[{"volume-title":"Proceedings of Workshop on Machine Learning Systems in The 31st Annual Conference on Neural Information Processing Systems.","year":"2017","author":"Akiba Takuya","key":"e_1_3_2_1_1_1","unstructured":"Takuya Akiba , Keisuke Fukuda , and Shuji Suzuki . 2017 . ChainerMN: Scalable Distributed Deep Learning Framework . In Proceedings of Workshop on Machine Learning Systems in The 31st Annual Conference on Neural Information Processing Systems. Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. 2017. ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on Machine Learning Systems in The 31st Annual Conference on Neural Information Processing Systems."},{"volume-title":"Differential-geometrical methods in statistics","author":"Amari S.","key":"e_1_3_2_1_2_1","unstructured":"S. Amari . 1985. Differential-geometrical methods in statistics . Springer-Verlag . S. Amari. 1985. Differential-geometrical methods in statistics. Springer-Verlag."},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning","volume":"70","author":"Botev Aleksandar","year":"2017","unstructured":"Aleksandar Botev , Hippolyt Ritter , and David Barber . 2017 . Practical Gauss-Newton Optimisation for Deep Learning . In Proceedings of the 34th International Conference on Machine Learning , Vol. 70 . 557--565. Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton Optimisation for Deep Learning. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. 557--565."},{"key":"e_1_3_2_1_4_1","unstructured":"Alfredo Canziani Adam Paszke and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. (2016). arXiv:1605.07678 Alfredo Canziani Adam Paszke and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. (2016). arXiv:1605.07678"},{"key":"e_1_3_2_1_5_1","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014). arXiv:1410.0759 Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014). arXiv:1410.0759"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2021068"},{"key":"e_1_3_2_1_7_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677 Priya Goyal Piotr Doll\u00e1r Ross B. Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677"},{"key":"e_1_3_2_1_8_1","volume-title":"Proceedings of the 33rd International Conference on International Conference on Machine Learning","volume":"48","author":"Grosse Roger","year":"2016","unstructured":"Roger Grosse and James Martens . 2016 . A Kronecker-factored Approximate Fisher Matrix for Convolution Layers . In Proceedings of the 33rd International Conference on International Conference on Machine Learning , Vol. 48 . 573--582. Roger Grosse and James Martens. 2016. A Kronecker-factored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. 573--582."},{"volume-title":"Asymmetric Valleys: Beyond Sharp and Flat Local Minima.","year":"2019","author":"He Haowei","key":"e_1_3_2_1_9_1","unstructured":"Haowei He , Gao Huang , and Yang Yuan . 2019 . Asymmetric Valleys: Beyond Sharp and Flat Local Minima. (2019). arXiv:1902.00744 Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharp and Flat Local Minima. (2019). arXiv:1902.00744"},{"key":"e_1_3_2_1_10_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385 Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385"},{"key":"e_1_3_2_1_11_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. (2016). arXiv:1603.05027 Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. (2016). arXiv:1603.05027"},{"volume-title":"Weinberger","year":"2016","author":"Huang Gao","key":"e_1_3_2_1_12_1","unstructured":"Gao Huang , Zhuang Liu , and Kilian Q . Weinberger . 2016 . Densely Connected Convolutional Networks . (2016). arXiv:1608.06993 Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. (2016). arXiv:1608.06993"},{"key":"e_1_3_2_1_13_1","unstructured":"Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. (2016). arXiv:1609.04836 Nitish Shirish Keskar Dheevatsa Mudigere Jorge Nocedal Mikhail Smelyanskiy and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. (2016). arXiv:1609.04836"},{"volume-title":"Kingma and Jimmy Ba","year":"2014","author":"Diederik","key":"e_1_3_2_1_14_1","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A Method for Stochastic Optimization . (2014). arXiv:1412.6980 Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. (2014). arXiv:1412.6980"},{"volume-title":"Use Local SGD.","year":"2018","author":"Lin Tao","key":"e_1_3_2_1_15_1","unstructured":"Tao Lin , Sebastian U. Stich , and Martin Jaggi . 2018. Don't Use Large Mini-Batches , Use Local SGD. ( 2018 ). arXiv:1808.07217 Tao Lin, Sebastian U. Stich, and Martin Jaggi. 2018. Don't Use Large Mini-Batches, Use Local SGD. (2018). arXiv:1808.07217"},{"key":"e_1_3_2_1_16_1","unstructured":"Yao Lu Mehrtash Harandi Richard I. Hartley and Razvan Pascanu. 2018. Block Mean Approximation for Efficient Second Order Optimization. (2018). arXiv:1804.05484 Yao Lu Mehrtash Harandi Richard I. Hartley and Razvan Pascanu. 2018. Block Mean Approximation for Efficient Second Order Optimization. (2018). arXiv:1804.05484"},{"volume-title":"Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter.","year":"2018","author":"Markidis Stefano","key":"e_1_3_2_1_17_1","unstructured":"Stefano Markidis , Steven Wei Der Chien , Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018 . NVIDIA Tensor Core Programmability, Performance & Precision . (2018). arXiv:1803.04014 Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. (2018). arXiv:1803.04014"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/3104322.3104416"},{"key":"e_1_3_2_1_19_1","volume-title":"Proceedings of the 32nd International Conference on Machine Learning","volume":"37","author":"Martens James","year":"2015","unstructured":"James Martens and Roger Grosse . 2015 . Optimizing Neural Networks with Kronecker-factored Approximate Curvature . In Proceedings of the 32nd International Conference on Machine Learning , Vol. 37 . 2408--2417. James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 2408--2417."},{"volume-title":"An Empirical Model of Large-Batch Training. CoRR abs\/1812.06162","year":"2018","author":"McCandlish Sam","key":"e_1_3_2_1_20_1","unstructured":"Sam McCandlish , Jared Kaplan , Dario Amodei , and Open AI Dota Team . 2018. An Empirical Model of Large-Batch Training. CoRR abs\/1812.06162 ( 2018 ). Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. CoRR abs\/1812.06162 (2018)."},{"volume-title":"Proceedings of the 16th International Conference on Neural Information Processing Systems. 209--216","author":"Mizutani Eiji","key":"e_1_3_2_1_21_1","unstructured":"Eiji Mizutani and James W. Demmel . 2003. Iterative Scaled Trust-region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-vector Multiply . In Proceedings of the 16th International Conference on Neural Information Processing Systems. 209--216 . Eiji Mizutani and James W. Demmel. 2003. Iterative Scaled Trust-region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-vector Multiply. In Proceedings of the 16th International Conference on Neural Information Processing Systems. 209--216."},{"key":"e_1_3_2_1_22_1","unstructured":"NVIDIA. 2017. NVIDIA DALI documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/dali-developer-guide\/docs\/index.html NVIDIA. 2017. NVIDIA DALI documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/dali-developer-guide\/docs\/index.html"},{"key":"e_1_3_2_1_23_1","unstructured":"NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"e_1_3_2_1_24_1","unstructured":"NVIDIA. 2018. cuDNN Developer Guide: Deep Learning SDK Documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html NVIDIA. 2018. cuDNN Developer Guide: Deep Learning SDK Documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html"},{"key":"e_1_3_2_1_25_1","unstructured":"NVIDIA. 2018. NVIDIA Collective Communications Library (NCCL) | NVIDIA Developer. https:\/\/developer.nvidia.com\/nccl NVIDIA. 2018. NVIDIA Collective Communications Library (NCCL) | NVIDIA Developer. https:\/\/developer.nvidia.com\/nccl"},{"key":"e_1_3_2_1_26_1","unstructured":"NVIDIA. 2019. Training with Mixed Precision. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf NVIDIA. 2019. Training with Mixed Precision. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"e_1_3_2_1_27_1","unstructured":"Yann Ollivier. 2017. True Asymptotic Natural Gradient Optimization. (2017). arXiv:1712.08449 Yann Ollivier. 2017. True Asymptotic Natural Gradient Optimization. (2017). arXiv:1712.08449"},{"key":"e_1_3_2_1_28_1","unstructured":"Kazuki Osawa Yohei Tsuji Yuichiro Ueno Akira Naruse Rio Yokota and Satoshi Matsuoka. 2018. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs. (2018). arXiv:1811.12019 Kazuki Osawa Yohei Tsuji Yuichiro Ueno Akira Naruse Rio Yokota and Satoshi Matsuoka. 2018. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs. (2018). arXiv:1811.12019"},{"key":"e_1_3_2_1_29_1","unstructured":"Razvan Pascanu and Yoshua Bengio. 2014. Revisiting Natural Gradient for Deep Networks. (2014). arXiv:1301.3584 Razvan Pascanu and Yoshua Bengio. 2014. Revisiting Natural Gradient for Deep Networks. (2014). arXiv:1301.3584"},{"volume-title":"Paleo: A Performance Model for Deep Neural Networks.","year":"2017","author":"Qi Hang","key":"e_1_3_2_1_30_1","unstructured":"Hang Qi , Evan R. Sparks , and Ameet Talwalkar . 2017 . Paleo: A Performance Model for Deep Neural Networks. (2017). Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. (2017)."},{"volume-title":"Proceedings of the 20th International Conference on Neural Information Processing Systems. 849--856","year":"2008","author":"Roux Nicolas L.","key":"e_1_3_2_1_31_1","unstructured":"Nicolas L. Roux , Pierre antoine Manzagol , and Yoshua Bengio . 2008 . Topmoumoute Online Natural Gradient Algorithm . In Proceedings of the 20th International Conference on Neural Information Processing Systems. 849--856 . Nicolas L. Roux, Pierre antoine Manzagol, and Yoshua Bengio. 2008. Topmoumoute Online Natural Gradient Algorithm. In Proceedings of the 20th International Conference on Neural Information Processing Systems. 849--856."},{"volume-title":"Proceedings of the 27th International Conference on Machine Learning.","author":"Roux Nicolas Le","key":"e_1_3_2_1_32_1","unstructured":"Nicolas Le Roux and Andrew W. Fitzgibbon . 2010. A fast natural Newton method . In Proceedings of the 27th International Conference on Machine Learning. Nicolas Le Roux and Andrew W. Fitzgibbon. 2010. A fast natural Newton method. In Proceedings of the 27th International Conference on Machine Learning."},{"volume-title":"Interspeech","year":"2014","author":"Seide Frank","key":"e_1_3_2_1_33_1","unstructured":"Frank Seide and Hao Fu. 2014. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs . In Interspeech 2014 . Frank Seide and Hao Fu. 2014. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. In Interspeech 2014."},{"volume-title":"Dahl","year":"2018","author":"Shallue Christopher J.","key":"e_1_3_2_1_34_1","unstructured":"Christopher J. Shallue , Jaehoon Lee , Joseph M. Antognini , Jascha Sohl-Dickstein , Roy Frostig , and George E . Dahl . 2018 . Measuring the Effects of Data Parallelism on Neural Network Training . (2018). arXiv:1811.03600 Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. 2018. Measuring the Effects of Data Parallelism on Neural Network Training. (2018). arXiv:1811.03600"},{"volume-title":"Proceedings of the 6th International Conference on Learning Representations.","author":"Smith Samuel L.","key":"e_1_3_2_1_35_1","unstructured":"Samuel L. Smith , Pieter-Jan Kindermans , and Quoc V. Le . 2018. Don't Decay the Learning Rate, Increase the Batch Size . In Proceedings of the 6th International Conference on Learning Representations. Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. 2018. Don't Decay the Learning Rate, Increase the Batch Size. In Proceedings of the 6th International Conference on Learning Representations."},{"volume-title":"Proceedings of Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems.","year":"2015","author":"Tokui Seiya","key":"e_1_3_2_1_36_1","unstructured":"Seiya Tokui , Kenta Oono , Shohei Hido , and Justin Clayton . 2015 . Chainer: a Next-Generation Open Source Framework for Deep Learning . In Proceedings of Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems. Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2019.00057"},{"volume-title":"Proceedings of the 15th International Conference on Artificial Intelligence and Statistics.","year":"2012","author":"Vinyals Oriol","key":"e_1_3_2_1_38_1","unstructured":"Oriol Vinyals and Daniel Povey . 2012 . Krylov Subspace Descent for Deep Learning . In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. Oriol Vinyals and Daniel Povey. 2012. Krylov Subspace Descent for Deep Learning. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics."},{"volume-title":"IEEE Hot Chips 20 Symposium. 1--71","author":"Williams S.","key":"e_1_3_2_1_39_1","unstructured":"S. Williams , D. Patterson , L. Oliker , J. Shalf , and K. Yelick . 2008. The roofline model: A pedagogical tool for program analysis and optimization . In IEEE Hot Chips 20 Symposium. 1--71 . S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. 2008. The roofline model: A pedagogical tool for program analysis and optimization. In IEEE Hot Chips 20 Symposium. 1--71."},{"key":"e_1_3_2_1_40_1","unstructured":"Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. (2016). arXiv:1611.05431 Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. (2016). arXiv:1611.05431"},{"key":"e_1_3_2_1_41_1","unstructured":"Chris Ying Sameer Kumar Dehao Chen Tao Wang and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. (2018). arXiv:1811.06992 Chris Ying Sameer Kumar Dehao Chen Tao Wang and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. (2018). arXiv:1811.06992"},{"key":"e_1_3_2_1_42_1","unstructured":"Yang You Igor Gitman and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017). arXiv:1708.03888 Yang You Igor Gitman and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017). arXiv:1708.03888"}],"event":{"name":"ICPP 2019: Workshops","sponsor":["University of Tsukuba University of Tsukuba"],"location":"Kyoto Japan","acronym":"ICPP 2019"},"container-title":["Workshop Proceedings of the 48th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3339186.3339202","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,15]],"date-time":"2023-01-15T00:33:35Z","timestamp":1673742815000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3339186.3339202"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,5]]},"references-count":42,"alternative-id":["10.1145\/3339186.3339202","10.1145\/3339186"],"URL":"http:\/\/dx.doi.org\/10.1145\/3339186.3339202","relation":{},"subject":[],"published":{"date-parts":[[2019,8,5]]},"assertion":[{"value":"2019-08-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}