{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,3]],"date-time":"2025-11-03T13:40:36Z","timestamp":1762177236939},"reference-count":73,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2019,7]]},"abstract":"<jats:p>Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the learning rate to compensate for this, which is complex and model-specific.<\/jats:p>\n          <jats:p>\n            We describe Crossbow, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size---however small---while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. We introduce SMA, a synchronous variant of model averaging in which replicas\n            <jats:italic>independently<\/jats:italic>\n            explore the solution space with gradient descent, but adjust their search\n            <jats:italic>synchronously<\/jats:italic>\n            based on the trajectory of a globally-consistent average model. Crossbow achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. our experiments show that Crossbow improves the training time of deep learning models on an 8-GPU server by 1.3--4X compared to TensorFlow.\n          <\/jats:p>","DOI":"10.14778\/3342263.3342276","type":"journal-article","created":{"date-parts":[[2019,9,18]],"date-time":"2019-09-18T18:36:11Z","timestamp":1568831771000},"page":"1399-1412","source":"Crossref","is-referenced-by-count":43,"title":["Crossbow"],"prefix":"10.14778","volume":"12","author":[{"given":"Alexandros","family":"Koliousis","sequence":"first","affiliation":[{"name":"Imperial College London"}]},{"given":"Pijika","family":"Watcharapichat","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Matthias","family":"Weidlich","sequence":"additional","affiliation":[{"name":"Humboldt-Universit\u00e4t zu Berlin"}]},{"given":"Luo","family":"Mai","sequence":"additional","affiliation":[{"name":"Imperial College London"}]},{"given":"Paolo","family":"Costa","sequence":"additional","affiliation":[{"name":"Microsoft Research"}]},{"given":"Peter","family":"Pietzuch","sequence":"additional","affiliation":[{"name":"Imperial College London"}]}],"member":"320","published-online":{"date-parts":[[2019,7]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Abadi M.","year":"2016","unstructured":"M. Abadi , P. Barham , J. Chen , Z. Chen , A. Davis , J. Dean , M. Devin , S. Ghemawat , G. Irving , M. Isard , TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2016 . M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016."},{"key":"e_1_2_1_2_1","unstructured":"Amazon EC2 Instance Types 2017. https:\/\/aws.amazon.com\/ec2\/instance-types\/.  Amazon EC2 Instance Types 2017. https:\/\/aws.amazon.com\/ec2\/instance-types\/."},{"key":"e_1_2_1_3_1","volume-title":"Feb.","author":"Arik S.","year":"2017","unstructured":"S. \u00d6. Arik , M. Chrzanowski , A. Coates , G. Diamos , A. Gibiansky , Y. Kang , X. Li , J. Miller , J. Raiman , S. Sengupta , and M. Shoeybi . Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825 {cs.CL} , Feb. 2017 . S. \u00d6. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi. Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825 {cs.CL}, Feb. 2017."},{"key":"e_1_2_1_4_1","first-page":"9","volume-title":"On-line Learning in Neural Networks","author":"Bottou L.","year":"1998","unstructured":"L. Bottou . On-line Learning and Stochastic Approximations . In D. Saad, editor, On-line Learning in Neural Networks , pages 9 -- 42 . Cambridge University Press , New York, NY, USA , 1998 . L. Bottou. On-line Learning and Stochastic Approximations. In D. Saad, editor, On-line Learning in Neural Networks, pages 9--42. Cambridge University Press, New York, NY, USA, 1998."},{"issue":"2","key":"e_1_2_1_5_1","first-page":"223","volume":"60","author":"Bottou L.","year":"2018","unstructured":"L. Bottou , F. Curtis , and J. Nocedal . Optimization Methods for Large-Scale Machine Learning. SIAM Review , 60 ( 2 ): 223 -- 311 , 2018 . L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223--311, 2018.","journal-title":"Optimization Methods for Large-Scale Machine Learning. SIAM Review"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000016"},{"key":"e_1_2_1_7_1","volume-title":"28th International Conference on Neural Information Processing Systems (NIPS)","author":"Chaturapruek S.","year":"2015","unstructured":"S. Chaturapruek , J. C. Duchi , and C. R\u00e9 . Asynchronous Stochastic Convex Optimization: The Noise Is in the Noise and SGD Don't Care . In 28th International Conference on Neural Information Processing Systems (NIPS) , 2015 . S. Chaturapruek, J. C. Duchi, and C. R\u00e9. Asynchronous Stochastic Convex Optimization: The Noise Is in the Noise and SGD Don't Care. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015."},{"key":"e_1_2_1_8_1","volume-title":"Apr.","author":"Chen J.","year":"2016","unstructured":"J. Chen , R. Monga , S. Bengio , and R. J\u00f3zefowicz . Revisiting Distributed Synchronous SGD. arXiv:1604.00981 {cs.LG} , Apr. 2016 . J. Chen, R. Monga, S. Bengio, and R. J\u00f3zefowicz. Revisiting Distributed Synchronous SGD. arXiv:1604.00981 {cs.LG}, Apr. 2016."},{"key":"e_1_2_1_9_1","volume-title":"Dec.","author":"Chen T.","year":"2015","unstructured":"T. Chen , M. Li , Y. Li , M. Lin , N. Wang , M. Wang , T. Xiao , B. Xu , C. Zhang , and Z. Zhang . MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 {cs.DC} , Dec. 2015 . T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 {cs.DC}, Dec. 2015."},{"key":"e_1_2_1_10_1","volume-title":"The Loss Surfaces of Multilayer Networks. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS)","author":"Choromanska A.","year":"2015","unstructured":"A. Choromanska , M. Henaff , M. Mathieu , G. B. Arous , and Y. LeCun . The Loss Surfaces of Multilayer Networks. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS) , 2015 . A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The Loss Surfaces of Multilayer Networks. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015."},{"key":"e_1_2_1_11_1","volume-title":"2014 USENIX Annual Technical Conference (ATC)","author":"Cui H.","year":"2014","unstructured":"H. Cui , J. Cipar , Q. Ho , J. K. Kim , S. Lee , A. Kumar , J. Wei , W. Dai , G. R. Ganger , P. B. Gibbons , G. A. Gibson , and E. P. Xing . Exploiting Bounded Staleness to Speed Up Big Data Analytics . In 2014 USENIX Annual Technical Conference (ATC) , 2014 . H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In 2014 USENIX Annual Technical Conference (ATC), 2014."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901323"},{"key":"e_1_2_1_13_1","volume-title":"Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS)","author":"Dean J.","year":"2012","unstructured":"J. Dean , G. Corrado , R. Monga , K. Chen , M. Devin , M. Mao , M. aurelio Ranzato , A. Senior , P. Tucker , K. Yang , Q. V. Le , and A. Y. Ng . Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS) , 2012 . J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.112130030"},{"key":"e_1_2_1_15_1","volume-title":"Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS)","author":"Defossez A.","year":"2015","unstructured":"A. Defossez and F. Bach . Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS) , 2015 . A. Defossez and F. Bach. Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions. In 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015."},{"key":"e_1_2_1_16_1","first-page":"1706","volume":"1","author":"Goyal P.","year":"2017","unstructured":"P. Goyal , P. Doll\u00e1r , R. B. Girshick , P. Noordhuis , L. Wesolowski , A. Kyrola , A. Tulloch , Y. Jia , and K. He. Accurate , Large Minibatch SGD : Training ImageNet in 1 Hour. arXiv: 1706 .02677 {cs.CV}, June 2017 . P. Goyal, P. Doll\u00e1r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 {cs.CV}, June 2017.","journal-title":"Training ImageNet in"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987554"},{"key":"e_1_2_1_18_1","volume-title":"Dec.","author":"He K.","year":"2015","unstructured":"K. He , X. Zhang , S. Ren , and J. Sun . Deep Residual Learning for Image Recognition. arXiv:1512.03385 {cs.CV} , Dec. 2015 . K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 {cs.CV}, Dec. 2015."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2205597"},{"key":"e_1_2_1_20_1","volume-title":"26th International Conference on Neural Information Processing Systems (NIPS)","author":"Ho Q.","year":"2013","unstructured":"Q. Ho , J. Cipar , H. Cui , S. Lee , J. K. Kim , P. B. Gibbons , G. A. Gibson , G. Ganger , and E. P. Xing . More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server . In 26th International Conference on Neural Information Processing Systems (NIPS) , 2013 . Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In 26th International Conference on Neural Information Processing Systems (NIPS), 2013."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.1.1"},{"key":"e_1_2_1_22_1","volume-title":"Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In 30th International Conference on Neural Information Processing Systems (NIPS)","author":"Hoffer E.","year":"2017","unstructured":"E. Hoffer , I. Hubara , and D. Soudry . Train Longer , Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In 30th International Conference on Neural Information Processing Systems (NIPS) , 2017 . E. Hoffer, I. Hubara, and D. Soudry. Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In 30th International Conference on Neural Information Processing Systems (NIPS), 2017."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3187009.3177734"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00070"},{"issue":"223","key":"e_1_2_1_25_1","first-page":"1","article-title":"Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification","volume":"18","author":"Jain P.","year":"2018","unstructured":"P. Jain , S. M. Kakade , R. Kidambi , P. Netrapalli , and A. Sidford . Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification . Journal of Machine Learning Research , 18 ( 223 ): 1 -- 42 , 2018 . P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification. Journal of Machine Learning Research, 18(223):1--42, 2018.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_26_1","volume-title":"Nov.","author":"Jastrzebski S.","year":"2017","unstructured":"S. Jastrzebski , Z. Kenton , D. Arpit , N. Ballas , A. Fischer , Y. Bengio , and A. J. Storkey . Three Factors Influencing Minimain SGD. arXiv:1711.04623 {cs.LG} , Nov. 2017 . S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. J. Storkey. Three Factors Influencing Minimain SGD. arXiv:1711.04623 {cs.LG}, Nov. 2017."},{"key":"e_1_2_1_27_1","volume-title":"July","author":"Jia X.","year":"2018","unstructured":"X. Jia , S. Song , W. He , Y. Wang , H. Rong , F. Zhou , L. Xie , Z. Guo , Y. Yang , L. Yu , T. Chen , G. Hu , S. Shi , and X. Chu . Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 {cs.LG} , July 2018 . X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 {cs.LG}, July 2018."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_29_1","volume-title":"Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In 35th International Conference on Machine Learning (ICML)","author":"Jia Z.","year":"2018","unstructured":"Z. Jia , S. Lin , C. R. Qi , and A. Aiken . Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In 35th International Conference on Machine Learning (ICML) , 2018 . Z. Jia, S. Lin, C. R. Qi, and A. Aiken. Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks. In 35th International Conference on Machine Learning (ICML), 2018."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035933"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00065"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_33_1","volume-title":"Sept.","author":"Keskar N. S.","year":"2016","unstructured":"N. S. Keskar , D. Mudigere , J. Nocedal , M. Smelyanskiy , and P. T. P. Tang . On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 {cs.LG} , Sept. 2016 . N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 {cs.LG}, Sept. 2016."},{"key":"e_1_2_1_34_1","volume-title":"Apr.","author":"Krizhevsky A.","year":"2014","unstructured":"A. Krizhevsky . One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv:1404.5997 {cs.NE} , Apr. 2014 . A. Krizhevsky. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv:1404.5997 {cs.NE}, Apr. 2014."},{"key":"e_1_2_1_35_1","volume-title":"ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS)","author":"Krizhevsky A.","year":"2012","unstructured":"A. Krizhevsky , I. Sutskever , and G. E. Hinton . ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS) , 2012 . A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), 2012."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_37_1","first-page":"9","volume-title":"Efficient BackProp","author":"LeCun Y.","year":"1998","unstructured":"Y. LeCun , L. Bottou , G. B. Orr , and K. R. M\u00fcller . Efficient BackProp , pages 9 -- 50 . Springer Berlin Heidelberg , Berlin, Heidelberg , 1998 . Y. LeCun, L. Bottou, G. B. Orr, and K. R. M\u00fcller. Efficient BackProp, pages 9--50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998."},{"key":"e_1_2_1_38_1","volume-title":"Su. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Li M.","year":"2014","unstructured":"M. Li , D. G. Andersen , J. W. Park , A. J. Smola , A. Ahmed , V. Josifovski , J. Long , E. J. Shekita , and B.- Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2014 . M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2623330.2623612"},{"key":"e_1_2_1_40_1","volume-title":"Asynchronous Decentralized Parallel Stochastic Gradient Descent. In 35th International Conference on Machine Learning (ICML)","author":"Lian X.","year":"2018","unstructured":"X. Lian , W. Zhang , C. Zhang , and J. Liu . Asynchronous Decentralized Parallel Stochastic Gradient Descent. In 35th International Conference on Machine Learning (ICML) , 2018 . X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In 35th International Conference on Machine Learning (ICML), 2018."},{"key":"e_1_2_1_41_1","volume-title":"Apr.","author":"Masters D.","year":"2018","unstructured":"D. Masters and C. Luschi . Revisiting Small Batch Training for Deep Neural Networks. arXiv:1804.07612 {cs.LG} , Apr. 2018 . D. Masters and C. Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv:1804.07612 {cs.LG}, Apr. 2018."},{"key":"e_1_2_1_42_1","volume-title":"Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Moritz P.","year":"2018","unstructured":"P. Moritz , R. Nishihara , S. Wang , A. Tumanov , R. Liaw , E. Liang , M. Elibol , Z. Yang , W. Paul , M. I. Jordan , and I. Stoica . Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , 2018 . P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018."},{"key":"e_1_2_1_43_1","volume-title":"Accelerating Model Search with Model Batching. In 1st Conference on Systems and Machine Learning (SysML), SysML '18","author":"Narayanan D.","year":"2018","unstructured":"D. Narayanan , K. Santhanam , and M. Zaharia . Accelerating Model Search with Model Batching. In 1st Conference on Systems and Machine Learning (SysML), SysML '18 , 2018 . D. Narayanan, K. Santhanam, and M. Zaharia. Accelerating Model Search with Model Batching. In 1st Conference on Systems and Machine Learning (SysML), SysML '18, 2018."},{"key":"e_1_2_1_44_1","first-page":"372","article-title":"A Method of Solving a Convex Programming Problem with Convergence Rate O(1\/k<sup>2<\/sup>)","volume":"27","author":"Nesterov Y.","year":"1983","unstructured":"Y. Nesterov . A Method of Solving a Convex Programming Problem with Convergence Rate O(1\/k<sup>2<\/sup>) . Soviet Mathematics Doklady , 27 : 372 -- 376 , 1983 . Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1\/k<sup>2<\/sup>). Soviet Mathematics Doklady, 27:372--376, 1983.","journal-title":"Soviet Mathematics Doklady"},{"key":"e_1_2_1_45_1","volume-title":"Distributed Machine Learning and Matrix Computations NIPS 2014 Workshop","author":"Noel C.","year":"2014","unstructured":"C. Noel and S. Osindero . Dogwild! - Distributed Hogwild for CPU and GPU . Distributed Machine Learning and Matrix Computations NIPS 2014 Workshop , 2014 . C. Noel and S. Osindero. Dogwild! - Distributed Hogwild for CPU and GPU. Distributed Machine Learning and Matrix Computations NIPS 2014 Workshop, 2014."},{"key":"e_1_2_1_46_1","unstructured":"NVIDIA Collective Communications Library (NCCL) 2018. https:\/\/developer.nvidia.com\/nccl.  NVIDIA Collective Communications Library (NCCL) 2018. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_1_47_1","unstructured":"NVLink Fabric Multi-GPU Processing 2018. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/.  NVLink Fabric Multi-GPU Processing 2018. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/."},{"key":"e_1_2_1_48_1","unstructured":"Octoputer 4U 10-GPU Server with Single Root Complex for GPU-Direct 2018. https:\/\/www.microway.com\/product\/octoputer-4u-10-gpu-server-single-root-complex\/.  Octoputer 4U 10-GPU Server with Single Root Complex for GPU-Direct 2018. https:\/\/www.microway.com\/product\/octoputer-4u-10-gpu-server-single-root-complex\/."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1016\/0041-5553(64)90137-5"},{"key":"e_1_2_1_50_1","volume-title":"Jan.","author":"Polyak B.","year":"1990","unstructured":"B. Polyak . New Stochastic Approximation Type Procedures . Avtomatica i Telemekhanika, 7(7):98--107 , Jan. 1990 . B. Polyak. New Stochastic Approximation Type Procedures. Avtomatica i Telemekhanika, 7(7):98--107, Jan. 1990."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1137\/0330046"},{"key":"e_1_2_1_52_1","unstructured":"PyTorch 2018. https:\/\/pytorch.org.  PyTorch 2018. https:\/\/pytorch.org."},{"key":"e_1_2_1_53_1","volume-title":"Litz: Elastic Framework for High-Performance Distributed Machine Learning. In 2018 USENIX Annual Technical Conference (ATC)","author":"Qiao A.","year":"2018","unstructured":"A. Qiao , A. Aghayev , W. Yu , H. Chen , Q. Ho , G. A. Gibson , and E. P. Xing . Litz: Elastic Framework for High-Performance Distributed Machine Learning. In 2018 USENIX Annual Technical Conference (ATC) , 2018 . A. Qiao, A. Aghayev, W. Yu, H. Chen, Q. Ho, G. A. Gibson, and E. P. Xing. Litz: Elastic Framework for High-Performance Distributed Machine Learning. In 2018 USENIX Annual Technical Conference (ATC), 2018."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115405"},{"key":"e_1_2_1_55_1","volume-title":"24th International Conference on Neural Information Processing Systems (NIPS)","author":"Recht B.","year":"2011","unstructured":"B. Recht , C. Re , S. Wright , and F. Niu . Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent . In 24th International Conference on Neural Information Processing Systems (NIPS) , 2011 . B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In 24th International Conference on Neural Information Processing Systems (NIPS), 2011."},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729586"},{"key":"e_1_2_1_57_1","doi-asserted-by":"crossref","first-page":"318","DOI":"10.7551\/mitpress\/5236.001.0001","volume-title":"Parallel Distributed Processing: Explorations in the Microstructure of Cognition","author":"Rumelhart D. E.","year":"1986","unstructured":"D. E. Rumelhart , G. E. Hinton , and R. J. Williams . Learning Internal Representations by Error Propagation . In D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition , Vol. 1 , pages 318 -- 362 . MIT Press , Cambridge, MA, USA , 1986 . D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation. In D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages 318--362. MIT Press, Cambridge, MA, USA, 1986."},{"key":"e_1_2_1_58_1","volume-title":"Technical Report 781, School of Operations Research and Industrial Enginnering","author":"Ruppert D.","year":"1988","unstructured":"D. Ruppert . Efficient Estimators from a Slowly Convergent Robbins-Monro Process. Technical Report 781, School of Operations Research and Industrial Enginnering , Cornell University, Ithaka , New York 14853--7501, Feb. 1988 . D. Ruppert. Efficient Estimators from a Slowly Convergent Robbins-Monro Process. Technical Report 781, School of Operations Research and Industrial Enginnering, Cornell University, Ithaka, New York 14853--7501, Feb. 1988."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2945397"},{"key":"e_1_2_1_60_1","volume-title":"Feb.","author":"Sergeev A.","year":"2018","unstructured":"A. Sergeev and M. D. Balso . Horovod: Fast and Easy Distributed Deep Learning in Tensor Flow. arXiv:1802.05799 {cs.LG} , Feb. 2018 . A. Sergeev and M. D. Balso. Horovod: Fast and Easy Distributed Deep Learning in Tensor Flow. arXiv:1802.05799 {cs.LG}, Feb. 2018."},{"key":"e_1_2_1_61_1","volume-title":"Nov.","author":"Shallue C. J.","year":"2018","unstructured":"C. J. Shallue , J. Lee , J. Antognini , J. Sohl-Dickstein , R. Frostig , and G. E. Dahl . Measuring the Effects of Data Parallelism on Neural Network Training. arXiv:1811.03600 {cs.LG} , Nov. 2018 . C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the Effects of Data Parallelism on Neural Network Training. arXiv:1811.03600 {cs.LG}, Nov. 2018."},{"key":"e_1_2_1_62_1","volume-title":"Increase the Batch Size. arXiv:1711.00489 {cs.LG}","author":"Smith S. L.","year":"2017","unstructured":"S. L. Smith , P. Kindermans , and Q. V. Le . Don't Decay the Learning Rate , Increase the Batch Size. arXiv:1711.00489 {cs.LG} , Nov. 2017 . S. L. Smith, P. Kindermans, and Q. V. Le. Don't Decay the Learning Rate, Increase the Batch Size. arXiv:1711.00489 {cs.LG}, Nov. 2017."},{"key":"e_1_2_1_63_1","volume-title":"On the Importance of Initialization and Momentum in Deep Learning. In 30th International Conference on Machine Learning (ICML)","author":"Sutskever I.","year":"2013","unstructured":"I. Sutskever , J. Martens , G. Dahl , and G. Hinton . On the Importance of Initialization and Momentum in Deep Learning. In 30th International Conference on Machine Learning (ICML) , 2013 . I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. In 30th International Conference on Machine Learning (ICML), 2013."},{"key":"e_1_2_1_64_1","unstructured":"TensorFlow Benchmarks 2018. https:\/\/github.com\/tensorflow\/benchmarks.  TensorFlow Benchmarks 2018. https:\/\/github.com\/tensorflow\/benchmarks."},{"key":"e_1_2_1_65_1","unstructured":"VGG16 models for CIFAR-10 and CIFAR-100 using Keras 2018.https:\/\/github.com\/geifmany\/cifar-vgg.  VGG16 models for CIFAR-10 and CIFAR-100 using Keras 2018.https:\/\/github.com\/geifmany\/cifar-vgg."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/2987550.2987586"},{"key":"e_1_2_1_68_1","volume-title":"Jan.","author":"Xiong W.","year":"2017","unstructured":"W. Xiong , J. Droppo , X. Huang , F. Seide , M. Seltzer , A. Stolcke , D. Yu , and G. Zweig . The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528 {cs.CL} , Jan. 2017 . W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528 {cs.CL}, Jan. 2017."},{"key":"e_1_2_1_69_1","volume-title":"Dec.","author":"Towards Optimal One Pass W. Xu.","year":"2011","unstructured":"W. Xu. Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent . arXiv:1107.2490 {cs.LG} , Dec. 2011 . W. Xu. Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. arXiv:1107.2490 {cs.LG}, Dec. 2011."},{"key":"e_1_2_1_70_1","volume-title":"Sept.","author":"You Y.","year":"2017","unstructured":"Y. You , I. Gitman , and B. Ginsburg . Large Batch Training of Convolutional Networks. arXiv:1708.03888 {cs.CV} , Sept. 2017 . Y. You, I. Gitman, and B. Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888 {cs.CV}, Sept. 2017."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2733001"},{"key":"e_1_2_1_72_1","volume-title":"June","author":"Zhang J.","year":"2017","unstructured":"J. Zhang and I. Mitliagkas . YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 {stat.ML} , June 2017 . J. Zhang and I. Mitliagkas. YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 {stat.ML}, June 2017."},{"key":"e_1_2_1_73_1","volume-title":"28th International Conference on Neural Information Processing Systems (NIPS)","author":"Zhang S.","year":"2015","unstructured":"S. Zhang , A. E. Choromanska , and Y. LeCun . Deep learning with Elastic Averaging SGD . In 28th International Conference on Neural Information Processing Systems (NIPS) , 2015 . S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with Elastic Averaging SGD. In 28th International Conference on Neural Information Processing Systems (NIPS), 2015."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3342263.3342276","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:57:34Z","timestamp":1672221454000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3342263.3342276"}},"subtitle":["scaling deep learning with small batch sizes on multi-GPU servers"],"short-title":[],"issued":{"date-parts":[[2019,7]]},"references-count":73,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2019,7]]}},"alternative-id":["10.14778\/3342263.3342276"],"URL":"https:\/\/doi.org\/10.14778\/3342263.3342276","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2019,7]]}}}