{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T09:14:58Z","timestamp":1775898898353,"version":"3.50.1"},"reference-count":280,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,8,30]],"date-time":"2019-08-30T00:00:00Z","timestamp":1567123200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100011264","name":"FP7 People: Marie-Curie Actions","doi-asserted-by":"crossref","award":["COFUND"],"award-info":[{"award-number":["COFUND"]}],"id":[{"id":"10.13039\/100011264","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100010663","name":"H2020 European Research Council","doi-asserted-by":"publisher","award":["678880"],"award-info":[{"award-number":["678880"]}],"id":[{"id":"10.13039\/100010663","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2020,7,31]]},"abstract":"<jats:p>Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.<\/jats:p>","DOI":"10.1145\/3320060","type":"journal-article","created":{"date-parts":[[2019,9,3]],"date-time":"2019-09-03T12:47:00Z","timestamp":1567514820000},"page":"1-43","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":414,"title":["Demystifying Parallel and Distributed Deep Learning"],"prefix":"10.1145","volume":"52","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3657-6568","authenticated-orcid":false,"given":"Tal","family":"Ben-Nun","sequence":"first","affiliation":[{"name":"ETH Zurich, Z\u00fcrich, Switzerland"}]},{"given":"Torsten","family":"Hoefler","sequence":"additional","affiliation":[{"name":"ETH Zurich, Z\u00fcrich, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2019,8,30]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http:\/\/www.tensorflow.org."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","unstructured":"A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24. MIT Press 873--881.","DOI":"10.5555\/2986459.2986557"},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","unstructured":"A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021","DOI":"10.18653\/v1\/D17-1045"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2474396"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294934"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045410"},{"key":"e_1_2_1_7_1","unstructured":"J. Appleyard T. Kocisk\u00fd and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs. arxiv:1604.01946"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/277651.277678"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","unstructured":"J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27. MIT Press 2654--2662.","DOI":"10.5555\/2969033.2969123"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Ba J.","unstructured":"J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approximations. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Baker B.","unstructured":"B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"key":"e_1_2_1_13_1","unstructured":"B. Baker O. Gupta R. Raskar and N. Naik. 2017. Practical neural network performance prediction for early stopping. arxiv:1705.10823."},{"key":"e_1_2_1_14_1","volume-title":"Sandia Report SAND2018-12790","author":"Barrett B. W.","year":"2018","unstructured":"B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical Report."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2015.30"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807611"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-39593-2_1"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","unstructured":"Y. Bengio P. Lamblin D. Popovici and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19. MIT Press 153--160.","DOI":"10.5555\/2976456.2976476"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.279181"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","unstructured":"P. Blanchard E. M. El Mhamdi R. Guerraoui and J. Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press 119--129.","DOI":"10.5555\/3294771.3294783"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/324133.324234"},{"key":"e_1_2_1_22_1","unstructured":"M. Bojarski D. Del Testa D. Dworakowski B. Firner B. Flepp P. Goyal L. D. Jackel M. Monfort U. Muller J. Zhang X. Zhang J. Zhao and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316"},{"key":"e_1_2_1_23_1","unstructured":"L. Bottou F. E. Curtis and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies","volume":"3","author":"Boyd S.","unstructured":"S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653--1664."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000016"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/321812.321815"},{"key":"e_1_2_1_27_1","volume-title":"SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344.","author":"Brock A.","year":"2017","unstructured":"A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1137\/140954362"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/1285358.1285359"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916)","author":"Chan W.","unstructured":"W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916). 4960--4964."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.","author":"Chellapilla K.","unstructured":"K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition."},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"C.-Y. Chen J. Choi D. Brand A. Agrawal W. Zhang and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual gradient compression for data-parallel distributed training. arxiv:1712.02679.","DOI":"10.1609\/aaai.v32i1.11728"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916)","author":"Chen K.","unstructured":"K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916). 5880--5884."},{"key":"e_1_2_1_34_1","volume-title":"TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.","author":"Chen T.","year":"2018","unstructured":"T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541967"},{"key":"e_1_2_1_36_1","unstructured":"T. Chen B. Xu C. Zhang and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","unstructured":"Y. Chen J. Li H. Xiao X. Jin S. Yan and J. Feng. 2017. Dual path networks. In Advances in Neural Information Processing Systems 30. MIT Press 4470--4478.","DOI":"10.5555\/3294996.3295200"},{"key":"e_1_2_1_38_1","unstructured":"S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685094"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_2_1_41_1","volume-title":"Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357","author":"Chollet F.","year":"2016","unstructured":"F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","unstructured":"C. Chu S. K. Kim Y. Lin Y. Yu G. Bradski K. Olukotun and A. Y. Ng. 2007. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19. MIT Press 281--288.","DOI":"10.5555\/2976456.2976492"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2626289"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418","author":"Cire\u015fan D. C.","unstructured":"D. C. Cire\u015fan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042817.3043086"},{"key":"e_1_2_1_46_1","volume-title":"Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Cohen N.","unstructured":"N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 4782--4791."},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the 29th Annual Conference on Learning Theory","volume":"49","author":"Cohen N.","unstructured":"N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, vol. 49. 698--728."},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn\u201911)","author":"Collobert R.","unstructured":"R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn\u201911)."},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of the International Conference on Artificial Neural Networks (ICANN\u201914)","author":"Cong J.","unstructured":"J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks (ICANN\u201914). 281--290."},{"key":"e_1_2_1_50_1","unstructured":"M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or &minus;1. arxiv:1602.02830"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969588"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901323"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","unstructured":"X. Cui W. Zhang Z. T\u00fcske and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems 31. MIT Press 6048--6058.","DOI":"10.5555\/3327345.3327504"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/155332.155333"},{"key":"e_1_2_1_55_1","unstructured":"J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient descent. arxiv:1803.05880."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969538"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999271"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.5555\/2188385.2188391"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","unstructured":"O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. MIT Press 666--674.","DOI":"10.5555\/2986459.2986534"},{"key":"e_1_2_1_61_1","unstructured":"J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905."},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201909)","author":"Deng J.","unstructured":"J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201909)."},{"key":"e_1_2_1_63_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201912)","author":"Deng L.","unstructured":"L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201912). 2133--2136."},{"key":"e_1_2_1_64_1","unstructured":"T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045604"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.5555\/648054.743935"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1016\/0743-7315(86)90020-1"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00031"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.5555\/3018874.3018875"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2021068"},{"key":"e_1_2_1_71_1","unstructured":"V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1207\/s15516709cog1402_1"},{"key":"e_1_2_1_73_1","unstructured":"T. Elsken J.-H. Metzen and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. arxiv:1711.04528."},{"key":"e_1_2_1_74_1","unstructured":"L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks. arxiv:1701.05130."},{"key":"e_1_2_1_75_1","volume-title":"Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666","author":"Farber P.","unstructured":"P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/30.5.413"},{"key":"e_1_2_1_77_1","unstructured":"K. Frans J. Ho X. Chen P. Abbeel and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767."},{"key":"e_1_2_1_78_1","unstructured":"M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373."},{"key":"e_1_2_1_79_1","unstructured":"A. Gaunt M. Johnson M. Riechert D. Tarlow R. Tomioka D. Vytiniotis and S. Webster. 2017. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arxiv:1705.09786."},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/3210377.3210394"},{"key":"e_1_2_1_81_1","volume-title":"Proceedings of the 13th International Conference on Artificial Intelligence and Statistics","volume":"9","author":"Glorot X.","unstructured":"X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249--256."},{"key":"e_1_2_1_82_1","volume-title":"Foundations of Genetic Algorithms","volume":"1","author":"Goldberg D. E.","unstructured":"D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, vol. 1. Elsevier, 69--93."},{"key":"e_1_2_1_83_1","unstructured":"Google. 2017. TensorFlow XLA Overview. Retrieved from https:\/\/www.tensorflow.org\/performance\/xla."},{"key":"e_1_2_1_84_1","volume-title":"SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.","author":"Goyal P.","year":"2017","unstructured":"P. Goyal, P. Doll\u00e1r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677."},{"key":"e_1_2_1_85_1","doi-asserted-by":"crossref","unstructured":"A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538 7626 (2016) 471--476.","DOI":"10.1038\/nature20101"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","unstructured":"W. Gropp T. Hoefler R. Thakur and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.","DOI":"10.5555\/2717108"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","unstructured":"A. Gruslys R. Munos I. Danihelka M. Lanctot and A. Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29. MIT Press 4125--4133.","DOI":"10.5555\/3157382.3157559"},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045303"},{"key":"e_1_2_1_89_1","volume-title":"Proceedings of the IEEE 16th International Conference on Data Mining (ICDM\u201916)","author":"Gupta S.","unstructured":"S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM\u201916). 171--180."},{"key":"e_1_2_1_90_1","volume-title":"Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487.","author":"Hadjis S.","year":"2016","unstructured":"S. Hadjis, C. Zhang, I. Mitliagkas, and C. R\u00e9. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487."},{"key":"e_1_2_1_91_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201916)","author":"Han S.","year":"2016","unstructured":"S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR\u201916) (2016)."},{"key":"e_1_2_1_92_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Hazan E.","unstructured":"E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.123"},{"key":"e_1_2_1_94_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"He K.","unstructured":"K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 770--778."},{"key":"e_1_2_1_95_1","volume-title":"Proceedings of the AAAI Workshops.","author":"He X.","unstructured":"X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural network. In Proceedings of the AAAI Workshops."},{"key":"e_1_2_1_96_1","unstructured":"G. Hinton. 2012. Neural Networks for Machine Learning Lecture 6a: Overview of Mini-batch Gradient Descent."},{"key":"e_1_2_1_97_1","volume-title":"Proceedings of the NIPS Deep Learning and Representation Learning Workshop.","author":"Hinton G.","unstructured":"G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop."},{"key":"e_1_2_1_98_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.2006.18.7.1527"},{"key":"e_1_2_1_99_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999611.2999748"},{"key":"e_1_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_101_1","volume-title":"Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917)","author":"Hoefler T.","unstructured":"T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unreliable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917)."},{"key":"e_1_2_1_102_1","doi-asserted-by":"publisher","DOI":"10.1145\/1362622.1362692"},{"key":"e_1_2_1_103_1","doi-asserted-by":"publisher","DOI":"10.14529\/jsfi140204"},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.86"},{"key":"e_1_2_1_105_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995896.1995909"},{"key":"e_1_2_1_106_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2009.5160935"},{"key":"e_1_2_1_107_1","doi-asserted-by":"publisher","unstructured":"E. Hoffer I. Hubara and D. Soudry. 2017. Train longer generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press 1729--1739.","DOI":"10.5555\/3294771.3294936"},{"key":"e_1_2_1_108_1","unstructured":"A. G. Howard M. Zhu B. Chen D. Kalenichenko W. Wang T. Weyand M. Andreetto and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861."},{"key":"e_1_2_1_109_1","doi-asserted-by":"publisher","DOI":"10.5555\/3154630.3154682"},{"key":"e_1_2_1_110_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","author":"Huang G.","unstructured":"G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_2_1_111_1","unstructured":"Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965."},{"key":"e_1_2_1_112_1","unstructured":"I. Hubara M. Courbariaux D. Soudry R. El-Yaniv and Y. Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arxiv:1609.07061."},{"key":"e_1_2_1_113_1","doi-asserted-by":"publisher","DOI":"10.1109\/JRPROC.1952.273898"},{"key":"e_1_2_1_114_1","unstructured":"F. N. Iandola M. W. Moskewicz K. Ashraf S. Han W. J. Dally and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &lt;1MB model size. arxiv:1602.07360."},{"key":"e_1_2_1_115_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Iandola F. N.","unstructured":"F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)."},{"key":"e_1_2_1_116_1","unstructured":"IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from https:\/\/www.ibm.com\/support\/knowledgecenter\/SSFHY8_6.2\/reference\/essl_reference_pdf.pdf."},{"key":"e_1_2_1_117_1","unstructured":"P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL Lausanne Switzerland."},{"key":"e_1_2_1_118_1","unstructured":"D. J. Im H. Ma C. D. Kim and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021."},{"key":"e_1_2_1_119_1","volume-title":"Intel Math Kernel Library. Reference Manual","unstructured":"Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation."},{"key":"e_1_2_1_120_1","unstructured":"Intel. 2017. MKL-DNN. Retrieved from https:\/\/01.org\/mkl-dnn."},{"key":"e_1_2_1_121_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045167"},{"key":"e_1_2_1_122_1","unstructured":"M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846."},{"key":"e_1_2_1_123_1","unstructured":"X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arxiv:1807.11205."},{"key":"e_1_2_1_124_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_125_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035933"},{"key":"e_1_2_1_126_1","volume-title":"InProceedings of the ML Systems Workshop at NIPS.","author":"Jin P. H.","unstructured":"P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the ML Systems Workshop at NIPS."},{"key":"e_1_2_1_127_1","unstructured":"M. Johnson et al. 2016. Google\u2019s multilingual neural machine translation system: Enabling zero-shot translation. arxiv:1611.04558."},{"key":"e_1_2_1_128_1","doi-asserted-by":"publisher","unstructured":"R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26. MIT Press 315--323.","DOI":"10.5555\/2999611.2999647"},{"key":"e_1_2_1_129_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_130_1","unstructured":"L. Kaiser A. N. Gomez N. Shazeer A. Vaswani N. Parmar L. Jones and J. Uszkoreit. 2017. One model to learn them all. arxiv:1706.05137."},{"key":"e_1_2_1_131_1","doi-asserted-by":"publisher","unstructured":"K. Kandasamy W. Neiswanger J. Schneider B. Poczos and E. P. Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press 2016--2025.","DOI":"10.5555\/3326943.3327130"},{"key":"e_1_2_1_132_1","unstructured":"T. Karras T. Aila S. Laine and J. Lehtinen. 2017. Progressive growing of GANs for improved quality stability and variation. arxiv:1710.10196."},{"key":"e_1_2_1_133_1","doi-asserted-by":"publisher","DOI":"10.1145\/2834892.2834893"},{"key":"e_1_2_1_134_1","unstructured":"H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. arxiv:1602.08191."},{"key":"e_1_2_1_135_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201916)","author":"Kim Y.-D.","unstructured":"Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR\u201916)."},{"key":"e_1_2_1_136_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201915)","author":"Kingma D. P.","unstructured":"D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_2_1_137_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR).","author":"Klein A.","unstructured":"A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_1_138_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294937"},{"key":"e_1_2_1_139_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Krishnan S.","unstructured":"S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_2_1_140_1","volume-title":"Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis","author":"Krizhevsky A.","unstructured":"A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis, University of Toronto, Canada."},{"key":"e_1_2_1_141_1","unstructured":"A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997."},{"key":"e_1_2_1_142_1","doi-asserted-by":"publisher","unstructured":"A. Krizhevsky I. Sutskever and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. MIT Press 1097--1105.","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_143_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126916"},{"key":"e_1_2_1_144_1","unstructured":"G. Lacey G. W. Taylor and S. Areibi. 2016. Deep learning on FPGAs: Past present and future. arxiv:1602.04283."},{"key":"e_1_2_1_145_1","doi-asserted-by":"publisher","DOI":"10.1145\/357172.357176"},{"key":"e_1_2_1_146_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Lavin A.","unstructured":"A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)."},{"key":"e_1_2_1_147_1","doi-asserted-by":"publisher","DOI":"10.5555\/3104482.3104516"},{"key":"e_1_2_1_148_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042573.3042641"},{"key":"e_1_2_1_149_1","doi-asserted-by":"crossref","unstructured":"Y. LeCun Y. Bengio and G. Hinton. 2015. Deep learning. Nature 521 7553 (2015) 436--444.","DOI":"10.1038\/nature14539"},{"key":"e_1_2_1_150_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1989.1.4.541"},{"key":"e_1_2_1_151_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_152_1","doi-asserted-by":"publisher","unstructured":"H. Lee P. Pham Y. Largman and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press 1096--1104.","DOI":"10.5555\/2984093.2984217"},{"key":"e_1_2_1_153_1","unstructured":"S. Lee S. Purushwalkam M. Cogswell D. J. Crandall and D. Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arxiv:1511.06314."},{"key":"e_1_2_1_154_1","doi-asserted-by":"publisher","DOI":"10.5555\/3014904.3014977"},{"key":"e_1_2_1_155_1","doi-asserted-by":"crossref","unstructured":"D. Li X. Wang and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices. arxiv:1708.04728.","DOI":"10.1609\/aaai.v32i1.11876"},{"key":"e_1_2_1_156_1","unstructured":"F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711."},{"key":"e_1_2_1_157_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685095"},{"key":"e_1_2_1_158_1","unstructured":"T. Li J. Zhong J. Liu W. Wu and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine learning workloads. arxiv:1708.07308."},{"key":"e_1_2_1_159_1","doi-asserted-by":"publisher","unstructured":"X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press 5336--5346.","DOI":"10.5555\/3295222.3295285"},{"key":"e_1_2_1_160_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969545"},{"key":"e_1_2_1_161_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning (ICML\u201918)","author":"Lian X.","unstructured":"X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML\u201918). 3043--3052."},{"key":"e_1_2_1_162_1","volume-title":"Proceedings of the International Conferecne on Learning Representations (ICLR\u201914)","author":"Lin M.","unstructured":"M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning Representations (ICLR\u201914)."},{"key":"e_1_2_1_163_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Lin Y.","unstructured":"Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_2_1_164_1","doi-asserted-by":"crossref","unstructured":"C. Liu B. Zoph J. Shlens W. Hua L.-J. Li L. Fei-Fei A. Yuille J. Huang and K. Murphy. 2017. Progressive neural architecture search. arxiv:1712.00559.","DOI":"10.1007\/978-3-030-01246-5_2"},{"key":"e_1_2_1_165_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Liu H.","unstructured":"H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_2_1_166_1","volume-title":"DARTS: Differentiable architecture search. arxiv:1806.09055.","author":"Liu H.","year":"2018","unstructured":"H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055."},{"key":"e_1_2_1_167_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201918)","author":"Liu X.","unstructured":"X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_2_1_168_1","doi-asserted-by":"publisher","DOI":"10.1145\/3067695.3084211"},{"key":"e_1_2_1_169_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Loshchilov I.","unstructured":"I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"key":"e_1_2_1_170_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327757.3327879"},{"key":"e_1_2_1_171_1","doi-asserted-by":"publisher","DOI":"10.5555\/3104322.3104416"},{"key":"e_1_2_1_172_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201914)","author":"Mathieu M.","unstructured":"M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of the International Conference on Learning Representations (ICLR\u201914)."},{"key":"e_1_2_1_173_1","volume-title":"MPI: A Message-Passing Interface Standard Version 3.1.","year":"2015","unstructured":"Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from https:\/\/www.mpi-forum.org\/docs\/mpi-3.1\/mpi31-report.pdf."},{"key":"e_1_2_1_174_1","volume-title":"Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201914)","author":"Miao Y.","unstructured":"Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201914). 830--834."},{"key":"e_1_2_1_175_1","doi-asserted-by":"crossref","unstructured":"R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.","DOI":"10.1145\/3067695.3067716"},{"key":"e_1_2_1_176_1","unstructured":"H. Mikami et al. 2018. ImageNet\/ResNet-50 training in 224 seconds. arxiv:1811.05233."},{"key":"e_1_2_1_177_1","volume-title":"Proceedings of the 19th International Conference on Artificial Intelligence and Statistics","volume":"51","author":"Moritz P.","unstructured":"P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249--258."},{"key":"e_1_2_1_178_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201916)","author":"Moritz P.","unstructured":"P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings of the International Conference on Learning Representations (ICLR\u201916)."},{"key":"e_1_2_1_179_1","volume-title":"Proceedings of the IEEE International Conference on Neural Networks","volume":"6","author":"Muller U. A.","unstructured":"U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE International Conference on Neural Networks, vol. 6. 3961--3966."},{"key":"e_1_2_1_180_1","unstructured":"R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures. arxiv:1704.08792."},{"key":"e_1_2_1_181_1","doi-asserted-by":"publisher","DOI":"10.1137\/070704277"},{"key":"e_1_2_1_182_1","first-page":"543","article-title":"A method of solving a convex programming problem with convergence rate O(1\/k)<sup>2<\/sup>","volume":"269","author":"Nesterov Y.","year":"1983","unstructured":"Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O(1\/k)<sup>2<\/sup>. Soviet Math. Doklady 269 (1983), 543--547.","journal-title":"Soviet Math. Doklady"},{"key":"e_1_2_1_183_1","unstructured":"Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http:\/\/www.netlib.org\/blas."},{"key":"e_1_2_1_184_1","doi-asserted-by":"publisher","unstructured":"J. Ngiam Z. Chen D. Chia P. W. Koh Q. V. Le and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems 23. MIT Press 1279--1287.","DOI":"10.5555\/2997189.2997332"},{"key":"e_1_2_1_185_1","unstructured":"J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer."},{"key":"e_1_2_1_186_1","volume-title":"Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.","author":"Noel C.","unstructured":"C. Noel and S. Osindero. 2014. Dogwild!\u2014Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations."},{"key":"e_1_2_1_187_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021740"},{"key":"e_1_2_1_188_1","unstructured":"NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https:\/\/devblogs.nvidia.com\/programming-tensor-cores-cuda-9."},{"key":"e_1_2_1_189_1","unstructured":"NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http:\/\/docs.nvidia.com\/cuda\/cublas."},{"key":"e_1_2_1_190_1","unstructured":"C. Olah. 2015. Understanding LSTM Networks. Retrieved from http:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs."},{"key":"e_1_2_1_191_1","unstructured":"K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35 Epochs. arxiv:1811.12019."},{"key":"e_1_2_1_192_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-6301"},{"key":"e_1_2_1_193_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2016.7840590"},{"key":"e_1_2_1_194_1","volume-title":"Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER\u201918)","author":"Oyama Y.","unstructured":"Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER\u201918)."},{"key":"e_1_2_1_195_1","unstructured":"PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https:\/\/github.com\/PaddlePaddle\/cloud\/tree\/develop\/doc\/edl."},{"key":"e_1_2_1_196_1","unstructured":"T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arxiv:1312.6186."},{"key":"e_1_2_1_197_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.191"},{"key":"e_1_2_1_198_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.09.002"},{"key":"e_1_2_1_199_1","unstructured":"F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arxiv:1712.06567."},{"key":"e_1_2_1_200_1","unstructured":"H. Pham M. Y. Guan B. Zoph Q. V. Le and J. Dean. 2018. Efficient neural architecture search via parameter sharing. arxiv:1802.03268."},{"key":"e_1_2_1_201_1","doi-asserted-by":"publisher","DOI":"10.1137\/0330046"},{"key":"e_1_2_1_202_1","unstructured":"D. Povey X. Zhang and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arxiv:1410.7455."},{"key":"e_1_2_1_203_1","doi-asserted-by":"crossref","unstructured":"R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.","DOI":"10.1109\/CAHPC.2018.8645935"},{"key":"e_1_2_1_204_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Qi H.","unstructured":"H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"key":"e_1_2_1_205_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0893-6080(98)00116-6"},{"key":"e_1_2_1_206_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24685-5_1"},{"key":"e_1_2_1_207_1","unstructured":"A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http:\/\/www.argmin.net\/2017\/12\/05\/kitchen-sinks NIPS Test of Time Award Talk."},{"key":"e_1_2_1_208_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_2_1_209_1","doi-asserted-by":"publisher","DOI":"10.5555\/1689499.1689510"},{"key":"e_1_2_1_210_1","doi-asserted-by":"crossref","unstructured":"M. Rastegari V. Ordonez J. Redmon and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arxiv:1603.05279.","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"e_1_2_1_211_1","unstructured":"E. Real A. Aggarwal Y. Huang and Q. V Le. 2018. Regularized evolution for image classifier architecture search. arxiv:1802.01548."},{"key":"e_1_2_1_212_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305890.3305981"},{"key":"e_1_2_1_213_1","doi-asserted-by":"publisher","DOI":"10.5555\/2986459.2986537"},{"key":"e_1_2_1_214_1","doi-asserted-by":"crossref","unstructured":"C. Renggli D. Alistarh and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learning. arxiv:1802.08021.","DOI":"10.1145\/3295500.3356222"},{"key":"e_1_2_1_215_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729586"},{"key":"e_1_2_1_216_1","doi-asserted-by":"publisher","unstructured":"T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press 901--909.","DOI":"10.5555\/3157096.3157197"},{"key":"e_1_2_1_217_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-274"},{"key":"e_1_2_1_218_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914)","author":"Seide F.","unstructured":"F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914). 235--239."},{"key":"e_1_2_1_219_1","doi-asserted-by":"publisher","unstructured":"S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.","DOI":"10.5555\/2621980"},{"key":"e_1_2_1_220_1","unstructured":"C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600."},{"key":"e_1_2_1_221_1","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157102"},{"key":"e_1_2_1_222_1","doi-asserted-by":"publisher","DOI":"10.1145\/2810103.2813687"},{"key":"e_1_2_1_223_1","doi-asserted-by":"crossref","unstructured":"D. Silver J. Schrittwieser K. Simonyan I. Antonoglou A. Huang A. Guez T. Hubert L. Baker M. Lai A. Bolton et al. 2017. Mastering the game of go without human knowledge. Nature 550 7676 (2017) 354.","DOI":"10.1038\/nature24270"},{"key":"e_1_2_1_224_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201915)","author":"Simonyan K.","unstructured":"K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_2_1_225_1","unstructured":"A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping. arxiv:1505.05972."},{"key":"e_1_2_1_226_1","unstructured":"S. L. Smith P. Kindermans and Q. V. Le. 2017. Don\u2019t decay the learning rate increase the batch size. arxiv:1711.00489."},{"key":"e_1_2_1_227_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999325.2999464"},{"key":"e_1_2_1_228_1","unstructured":"E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066."},{"key":"e_1_2_1_229_1","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)","author":"Song M.","unstructured":"M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917). 1--12."},{"key":"e_1_2_1_230_1","doi-asserted-by":"publisher","DOI":"10.1109\/78.205723"},{"key":"e_1_2_1_231_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02165411"},{"key":"e_1_2_1_232_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-354"},{"key":"e_1_2_1_233_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2761740"},{"key":"e_1_2_1_234_1","volume-title":"Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR\u201915)","author":"Szegedy C.","unstructured":"C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR\u201915)."},{"key":"e_1_2_1_235_1","unstructured":"G. Taylor R. Burmeister Z. Xu B. Singh A. Patel and T. Goldstein. 2016. Training neural networks without gradients: A scalable ADMM approach. (2016). arxiv:1605.02026"},{"key":"e_1_2_1_236_1","doi-asserted-by":"publisher","DOI":"10.1145\/2908080.2908105"},{"key":"e_1_2_1_237_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.1986.1104412"},{"key":"e_1_2_1_238_1","doi-asserted-by":"publisher","DOI":"10.1145\/2834892.2834897"},{"key":"e_1_2_1_239_1","volume-title":"Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS\u201911)","author":"Vanhoucke V.","unstructured":"V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS\u201911)."},{"key":"e_1_2_1_240_1","unstructured":"N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arxiv:1802.04730."},{"key":"e_1_2_1_241_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201915)","author":"Vasilache N.","unstructured":"N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft: A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_2_1_242_1","doi-asserted-by":"crossref","unstructured":"A. Vasudevan A. Anderson and D. Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. arxiv:1704.04428.","DOI":"10.1109\/ASAP.2017.7995254"},{"key":"e_1_2_1_243_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2015.71"},{"key":"e_1_2_1_244_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-017-1994-x"},{"key":"e_1_2_1_245_1","doi-asserted-by":"publisher","unstructured":"W. Wen C. Xu F. Yan C. Wu Y. Wang Y. Chen and H. Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press 1509--1519.","DOI":"10.5555\/3294771.3294915"},{"key":"e_1_2_1_246_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.58337"},{"key":"e_1_2_1_247_1","doi-asserted-by":"publisher","DOI":"10.5555\/1096474"},{"key":"e_1_2_1_248_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00992696"},{"key":"e_1_2_1_249_1","volume-title":"Arithmetic Complexity of Computations","author":"Winograd S.","unstructured":"S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics."},{"key":"e_1_2_1_250_1","volume-title":"Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)","author":"Xie L.","unstructured":"L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917). 1388--1397."},{"key":"e_1_2_1_251_1","doi-asserted-by":"publisher","DOI":"10.5555\/3020948.3021030"},{"key":"e_1_2_1_252_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2015.2472014"},{"key":"e_1_2_1_253_1","unstructured":"K. Xu et al. 2015. Show attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044."},{"key":"e_1_2_1_254_1","unstructured":"O. Yadan K. Adams Y. Taigman and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853."},{"key":"e_1_2_1_255_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783270"},{"key":"e_1_2_1_256_1","unstructured":"C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992."},{"key":"e_1_2_1_257_1","doi-asserted-by":"crossref","unstructured":"Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.","DOI":"10.1145\/3295500.3356137"},{"key":"e_1_2_1_258_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126912"},{"key":"e_1_2_1_259_1","unstructured":"Y. You I. Gitman and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888."},{"key":"e_1_2_1_260_1","unstructured":"Y. You Z. Zhang C. Hsieh and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes. arxiv:1709.05011"},{"key":"e_1_2_1_261_1","doi-asserted-by":"publisher","DOI":"10.1145\/3146347.3146355"},{"key":"e_1_2_1_262_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201916)","author":"Yu F.","unstructured":"F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR\u201916)."},{"key":"e_1_2_1_263_1","volume-title":"Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS\u201916)","author":"Yu Y.","unstructured":"Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS\u201916). 942--947."},{"key":"e_1_2_1_264_1","volume-title":"Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216","author":"Zhang H.","year":"2015","unstructured":"H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216"},{"key":"e_1_2_1_265_1","doi-asserted-by":"publisher","DOI":"10.5555\/3154690.3154708"},{"key":"e_1_2_1_266_1","unstructured":"J. Zhang I. Mitliagkas and C. R\u00e9. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471"},{"key":"e_1_2_1_267_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2014.2319813"},{"key":"e_1_2_1_268_1","doi-asserted-by":"publisher","unstructured":"S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28. MIT Press 685--693.","DOI":"10.5555\/2969239.2969316"},{"key":"e_1_2_1_269_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195662"},{"key":"e_1_2_1_270_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663","author":"Zhang S.","unstructured":"S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663."},{"key":"e_1_2_1_271_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2017.161"},{"key":"e_1_2_1_272_1","doi-asserted-by":"publisher","DOI":"10.5555\/3060832.3060950"},{"key":"e_1_2_1_273_1","doi-asserted-by":"publisher","unstructured":"X. Zhang M. McKenna J. P. Mesirov and D. L. Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press 801--809.","DOI":"10.5555\/109230.109324"},{"key":"e_1_2_1_274_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2014.36"},{"key":"e_1_2_1_275_1","unstructured":"Z. Zhong J. Yan and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552"},{"key":"e_1_2_1_276_1","unstructured":"S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arxiv:1606.06160"},{"key":"e_1_2_1_277_1","doi-asserted-by":"publisher","DOI":"10.5555\/2997046.2997185"},{"key":"e_1_2_1_278_1","doi-asserted-by":"publisher","DOI":"10.5555\/3014904.3015002"},{"key":"e_1_2_1_279_1","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR\u201917)","author":"Zoph B.","unstructured":"B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR\u201917)."},{"key":"e_1_2_1_280_1","doi-asserted-by":"crossref","unstructured":"B. Zoph V. Vasudevan J. Shlens and Q. V. Le. 2017. Learning transferable architectures for scalable image recognition. arxiv:1707.07012","DOI":"10.1109\/CVPR.2018.00907"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3320060","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3320060","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T13:27:22Z","timestamp":1750858042000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3320060"}},"subtitle":["An In-depth Concurrency Analysis"],"short-title":[],"issued":{"date-parts":[[2019,8,30]]},"references-count":280,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,7,31]]}},"alternative-id":["10.1145\/3320060"],"URL":"https:\/\/doi.org\/10.1145\/3320060","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,8,30]]},"assertion":[{"value":"2018-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-03-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}