{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T20:53:39Z","timestamp":1776804819433,"version":"3.51.2"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,7,25]],"date-time":"2019-07-25T00:00:00Z","timestamp":1564012800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGOPS Oper. Syst. Rev."],"published-print":{"date-parts":[[2019,7,25]]},"abstract":"<jats:p>Deep learning (DL) systems expose many tuning parameters (\"hyper-parameters\") that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning them empirically. We argue that future DL systems should be designed to help manage hyper-parameters. We describe how a distributed DL system can (i) remove the impact of hyper-parameters on both performance and accuracy, thus making it easier to decide on a good setting, and (ii) support more powerful dynamic policies for adapting hyper-parameters, which take monitored training metrics into account. We report results from prototype implementations that show the practicality of DL system designs that are hyper-parameter-friendly.<\/jats:p>","DOI":"10.1145\/3352020.3352029","type":"journal-article","created":{"date-parts":[[2019,7,26]],"date-time":"2019-07-26T13:17:18Z","timestamp":1564147038000},"page":"52-58","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["Taming Hyper-parameters in Deep Learning Systems"],"prefix":"10.1145","volume":"53","author":[{"given":"Luo","family":"Mai","sequence":"first","affiliation":[{"name":"Imperial College London, London, England UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alexandros","family":"Koliousis","sequence":"additional","affiliation":[{"name":"Imperial College London, London, England UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guo","family":"Li","sequence":"additional","affiliation":[{"name":"Imperial College London, London, England UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrei-Octavian","family":"Brabete","sequence":"additional","affiliation":[{"name":"Imperial College London, London, England UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peter","family":"Pietzuch","sequence":"additional","affiliation":[{"name":"Imperial College London, London, England UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,7,25]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16)","author":"ABADI M.","year":"2016","unstructured":"ABADI , M. , BARHAM , P. , CHEN , J. , CHEN , Z. , DAVIS , A. , DEAN , J. , DEVIN , M. , GHEMAWAT , S. , IRVING , G. , ISARD , M. , KUDLUR , M. , LEVENBERG , J. , MONGA , R. , MOORE , S. , MURRAY , D. G. , STEINER , B. , TUCKER , P. , VASUDEVAN , V. , WARDEN , P. , WICKE , M. , YU , Y. , AND ZHENG , X. Tensorflow : A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16) ( 2016 ), pp. 265 -- 283 . ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., KUDLUR, M., LEVENBERG, J., MONGA, R., MOORE, S., MURRAY, D. G., STEINER, B., TUCKER, P., VASUDEVAN, V., WARDEN, P., WICKE, M., YU, Y., AND ZHENG, X. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16) (2016), pp. 265--283."},{"key":"e_1_2_1_2_1","volume-title":"Can SGD learn recurrent neural networks with provable generalization? CoRR abs\/1902.01028","author":"ALLEN-ZHU Z.","year":"2019","unstructured":"ALLEN-ZHU , Z. , AND LI , Y. Can SGD learn recurrent neural networks with provable generalization? CoRR abs\/1902.01028 ( 2019 ). ALLEN-ZHU, Z., AND LI, Y. Can SGD learn recurrent neural networks with provable generalization? CoRR abs\/1902.01028 (2019)."},{"key":"e_1_2_1_3_1","volume-title":"Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR abs\/1811.04918","author":"ALLEN-ZHU Z.","year":"2018","unstructured":"ALLEN-ZHU , Z. , LI , Y. , AND LIANG , Y. Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR abs\/1811.04918 ( 2018 ). ALLEN-ZHU, Z., LI, Y., AND LIANG, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR abs\/1811.04918 (2018)."},{"key":"e_1_2_1_4_1","volume-title":"Amazon EC2 P3 Instance Product Details. https: \/\/aws.amazon.com\/ec2\/instance-types\/p3\/","author":"AMAZON.","year":"2019","unstructured":"AMAZON. Amazon EC2 P3 Instance Product Details. https: \/\/aws.amazon.com\/ec2\/instance-types\/p3\/ , 2019 . Online ; accessed: 2019-05--17. AMAZON. Amazon EC2 P3 Instance Product Details. https: \/\/aws.amazon.com\/ec2\/instance-types\/p3\/, 2019. Online; accessed: 2019-05--17."},{"key":"e_1_2_1_5_1","volume-title":"Amazon Spot Instance Prices. https:\/\/aws. amazon.com\/ec2\/spot\/pricing\/","author":"AMAZON.","year":"2019","unstructured":"AMAZON. Amazon Spot Instance Prices. https:\/\/aws. amazon.com\/ec2\/spot\/pricing\/ , 2019 . Online ; accessed: 2019-05--17. AMAZON. Amazon Spot Instance Prices. https:\/\/aws. amazon.com\/ec2\/spot\/pricing\/, 2019. Online; accessed: 2019-05--17."},{"key":"e_1_2_1_6_1","volume-title":"Deep voice: Realtime neural text-to-speech. CoRR abs\/1702.07825","author":"ARIK S.","year":"2017","unstructured":"ARIK , S. \u00a8O ., CHRZANOWSKI , M. , COATES , A. , DIAMOS , G. , GIBIANSKY , A. , KANG , Y. , LI , X. , MILLER , J. , RAIMAN , J. , SENGUPTA , S. , AND SHOEYBI , M. Deep voice: Realtime neural text-to-speech. CoRR abs\/1702.07825 ( 2017 ). ARIK, S. \u00a8O., CHRZANOWSKI, M., COATES, A., DIAMOS, G., GIBIANSKY, A., KANG, Y., LI, X., MILLER, J., RAIMAN, J., SENGUPTA, S., AND SHOEYBI, M. Deep voice: Realtime neural text-to-speech. CoRR abs\/1702.07825 (2017)."},{"key":"e_1_2_1_7_1","first-page":"3084","volume-title":"Proceedings of the 26th Interna- tional Conference on Neural Information Processing Systems -","volume":"2","author":"BA L. J.","year":"2013","unstructured":"BA , L. J. , AND FREY , B. Adaptive dropout for training deep neural networks . In Proceedings of the 26th Interna- tional Conference on Neural Information Processing Systems - Volume 2 (USA, 2013 ), NIPS'13, Curran Associates Inc. , pp. 3084 -- 3092 . BA, L. J., AND FREY, B. Adaptive dropout for training deep neural networks. In Proceedings of the 26th Interna- tional Conference on Neural Information Processing Systems - Volume 2 (USA, 2013), NIPS'13, Curran Associates Inc., pp. 3084--3092."},{"key":"e_1_2_1_8_1","volume-title":"Online learning rate adaptation with hypergradient descent. CoRR abs\/1703.04782","author":"BAYDIN A. G.","year":"2017","unstructured":"BAYDIN , A. G. , CORNISH , R. , MART\u00b4I NEZ-RUBIO , D. , SCHMIDT , M. , AND WOOD , F. D. Online learning rate adaptation with hypergradient descent. CoRR abs\/1703.04782 ( 2017 ). BAYDIN, A. G., CORNISH, R., MART\u00b4I NEZ-RUBIO, D., SCHMIDT, M., AND WOOD, F. D. Online learning rate adaptation with hypergradient descent. CoRR abs\/1703.04782 (2017)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"BOTTOU L. On-line learning and stochastic approximations. In On-line Learning in Neural Networks D. Saad Ed. 1998.   BOTTOU L. On-line learning and stochastic approximations. In On-line Learning in Neural Networks D. Saad Ed. 1998.","DOI":"10.1017\/CBO9780511569920.003"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1137\/16M1080173"},{"key":"e_1_2_1_11_1","volume-title":"MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs\/1512.01274","author":"CHEN T.","year":"2015","unstructured":"CHEN , T. , LI , M. , LI , Y. , LIN , M. , WANG , N. , WANG , M. , XIAO , T. , XU , B. , ZHANG , C. , AND ZHANG , Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs\/1512.01274 ( 2015 ). CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs\/1512.01274 (2015)."},{"key":"e_1_2_1_12_1","first-page":"1223","volume-title":"Advances in Neural Information Processing Systems 25","author":"DEAN J.","year":"2012","unstructured":"DEAN , J. , CORRADO , G. , MONGA , R. , CHEN , K. , DEVIN , M. , MAO , M. , AURELIO RANZATO , M. , SENIOR , A. , TUCKER , P. , YANG , K. , LE , Q. V. , AND NG , A. Y. Large Scale Distributed Deep Networks . In Advances in Neural Information Processing Systems 25 , F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc. , 2012 , pp. 1223 -- 1231 . DEAN, J., CORRADO, G., MONGA, R., CHEN, K., DEVIN, M., MAO, M., AURELIO RANZATO, M., SENIOR, A., TUCKER, P., YANG, K., LE, Q. V., AND NG, A. Y. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1223--1231."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.112130030"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_15_1","volume-title":"BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs\/1810.04805","author":"DEVLIN J.","year":"2018","unstructured":"DEVLIN , J. , CHANG , M. , LEE , K. , AND TOUTANOVA , K. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs\/1810.04805 ( 2018 ). DEVLIN, J., CHANG, M., LEE, K., AND TOUTANOVA, K. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs\/1810.04805 (2018)."},{"key":"e_1_2_1_16_1","volume-title":"Journal of Machine Learning Research","author":"DUCHI J.","year":"2011","unstructured":"DUCHI , J. , HAZAN , E. , AND SINGER , Y. Adaptive subgradient methods for online learning and stochastic optimization . Journal of Machine Learning Research 12, Jul ( 2011 ), 2121-- 2159. DUCHI, J., HAZAN, E., AND SINGER, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121-- 2159."},{"key":"e_1_2_1_17_1","volume-title":"Neural architecture search: A survey. arXiv preprint arXiv:1808.05377","author":"ELSKEN T.","year":"2018","unstructured":"ELSKEN , T. , METZEN , J. H. , AND HUTTER , F. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377 ( 2018 ). ELSKEN, T., METZEN, J. H., AND HUTTER, F. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377 (2018)."},{"key":"e_1_2_1_18_1","first-page":"2755","volume-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems -","volume":"2","author":"FEURER M.","year":"2015","unstructured":"FEURER , M. , KLEIN , A. , EGGENSPERGER , K. , SPRINGENBERG , J. T. , BLUM , M. , AND HUTTER , F. Efficient and robust automated machine learning . In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Cambridge, MA, USA, 2015 ), NIPS'15, MIT Press , pp. 2755 -- 2763 . FEURER, M., KLEIN, A., EGGENSPERGER, K., SPRINGENBERG, J. T., BLUM, M., AND HUTTER, F. Efficient and robust automated machine learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Cambridge, MA, USA, 2015), NIPS'15, MIT Press, pp. 2755--2763."},{"key":"e_1_2_1_19_1","first-page":"2672","volume-title":"Advances in neu- ral information processing systems","author":"GOODFELLOW I.","year":"2014","unstructured":"GOODFELLOW , I. , POUGET-ABADIE , J. , MIRZA , M. , XU , B. , WARDE-FARLEY , D. , OZAIR , S. , COURVILLE , A. , AND BENGIO , Y. Generative adversarial nets . In Advances in neu- ral information processing systems ( 2014 ), pp. 2672 -- 2680 . GOODFELLOW, I., POUGET-ABADIE, J., MIRZA, M., XU, B., WARDE-FARLEY, D., OZAIR, S., COURVILLE, A., AND BENGIO, Y. Generative adversarial nets. In Advances in neu- ral information processing systems (2014), pp. 2672--2680."},{"key":"e_1_2_1_20_1","volume-title":"Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677","author":"GOYAL P.","year":"2017","unstructured":"GOYAL , P. , DOLL\u00b4AR , P. , GIRSHICK , R. B. , NOORDHUIS , P. , WESOLOWSKI , L. , KYROLA , A. , TULLOCH , A. , JIA , Y. , AND HE , K. Accurate , Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 ( 2017 ). GOYAL, P., DOLL\u00b4AR, P., GIRSHICK, R. B., NOORDHUIS, P., WESOLOWSKI, L., KYROLA, A., TULLOCH, A., JIA, Y., AND HE, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 (2017)."},{"key":"e_1_2_1_21_1","first-page":"1737","volume-title":"International Conference on Machine Learning","author":"GUPTA S.","year":"2015","unstructured":"GUPTA , S. , AGRAWAL , A. , GOPALAKRISHNAN , K. , AND NARAYANAN , P. Deep learning with limited numerical precision . In International Conference on Machine Learning ( 2015 ), pp. 1737 -- 1746 . GUPTA, S., AGRAWAL, A., GOPALAKRISHNAN, K., AND NARAYANAN, P. Deep learning with limited numerical precision. In International Conference on Machine Learning (2015), pp. 1737--1746."},{"key":"e_1_2_1_22_1","volume-title":"Deep residual learning for image recognition. CoRR abs\/1512.03385","author":"HE K.","year":"2015","unstructured":"HE , K. , ZHANG , X. , REN , S. , AND SUN , J. Deep residual learning for image recognition. CoRR abs\/1512.03385 ( 2015 ). HE, K., ZHANG, X., REN, S., AND SUN, J. Deep residual learning for image recognition. CoRR abs\/1512.03385 (2015)."},{"key":"e_1_2_1_23_1","first-page":"1731","volume-title":"Advances in Neural Informa- tion Processing Systems 30","author":"HOFFER E.","year":"2017","unstructured":"HOFFER , E. , HUBARA , I. , AND SOUDRY , D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks . In Advances in Neural Informa- tion Processing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc. , 2017 , pp. 1731 -- 1741 . HOFFER, E., HUBARA, I., AND SOUDRY, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Informa- tion Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 1731--1741."},{"key":"e_1_2_1_24_1","volume-title":"Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR abs\/1807.11205","author":"JIA X.","year":"2018","unstructured":"JIA , X. , SONG , S. , HE , W. , WANG , Y. , RONG , H. , ZHOU , F. , XIE , L. , GUO , Z. , YANG , Y. , YU , L. , CHEN , T. , HU , G. , SHI , S. , AND CHU , X. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR abs\/1807.11205 ( 2018 ). JIA, X., SONG, S., HE, W., WANG, Y., RONG, H., ZHOU, F., XIE, L., GUO, Z., YANG, Y., YU, L., CHEN, T., HU, G., SHI, S., AND CHU, X. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR abs\/1807.11205 (2018)."},{"key":"e_1_2_1_25_1","volume-title":"Efficient neural architecture search with network morphism. CoRR abs\/1806.10282","author":"JIN H.","year":"2018","unstructured":"JIN , H. , SONG , Q. , AND HU , X. Efficient neural architecture search with network morphism. CoRR abs\/1806.10282 ( 2018 ). JIN, H., SONG, Q., AND HU, X. Efficient neural architecture search with network morphism. CoRR abs\/1806.10282 (2018)."},{"key":"e_1_2_1_26_1","volume-title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs\/1609.04836","author":"KESKAR N. S.","year":"2016","unstructured":"KESKAR , N. S. , MUDIGERE , D. , NOCEDAL , J. , SMELYANSKIY , M. , AND TANG , P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs\/1609.04836 ( 2016 ). KESKAR, N. S., MUDIGERE, D., NOCEDAL, J., SMELYANSKIY, M., AND TANG, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs\/1609.04836 (2016)."},{"key":"e_1_2_1_27_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"KINGMA D. P.","year":"2014","unstructured":"KINGMA , D. P. , AND BA , J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ( 2014 ). KINGMA, D. P., AND BA, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_1_28_1","volume-title":"CROSSBOW: scaling deep learning with small batch sizes on multi-gpu servers. CoRR abs\/1901.02244","author":"KOLIOUSIS A.","year":"2019","unstructured":"KOLIOUSIS , A. , WATCHARAPICHAT , P. , WEIDLICH , M. , MAI , L. , COSTA , P. , AND PIETZUCH , P. R. CROSSBOW: scaling deep learning with small batch sizes on multi-gpu servers. CoRR abs\/1901.02244 ( 2019 ). KOLIOUSIS, A., WATCHARAPICHAT, P., WEIDLICH, M., MAI, L., COSTA, P., AND PIETZUCH, P. R. CROSSBOW: scaling deep learning with small batch sizes on multi-gpu servers. CoRR abs\/1901.02244 (2019)."},{"key":"e_1_2_1_29_1","volume-title":"Convolutional deep belief networks on cifar-10","author":"KRIZHEVSKY A.","year":"2010","unstructured":"KRIZHEVSKY , A. Convolutional deep belief networks on cifar-10 , 2010 . KRIZHEVSKY, A. Convolutional deep belief networks on cifar-10, 2010."},{"key":"e_1_2_1_30_1","first-page":"9","article-title":"Springer Berlin Heidelberg, Berlin","author":"LECUN Y. A.","year":"2012","unstructured":"LECUN , Y. A. , BOTTOU , L. , ORR , G. B. , AND M\u00a8U LLER , K.-R. Efficient BackProp . Springer Berlin Heidelberg, Berlin , Heidelberg , 2012 , pp. 9 -- 48 . LECUN, Y. A., BOTTOU, L., ORR, G. B., AND M\u00a8U LLER, K.-R. Efficient BackProp. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 9--48.","journal-title":"Heidelberg"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/2685048.2685095"},{"key":"e_1_2_1_32_1","volume-title":"Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR abs\/1712.01887","author":"LIN Y.","year":"2017","unstructured":"LIN , Y. , HAN , S. , MAO , H. , WANG , Y. , AND DALLY , W. J. Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR abs\/1712.01887 ( 2017 ). LIN, Y., HAN, S., MAO, H., WANG, Y., AND DALLY, W. J. Deep gradient compression: Reducing the communication bandwidth for distributed training. CoRR abs\/1712.01887 (2017)."},{"key":"e_1_2_1_33_1","volume-title":"Optimizing neural networks with kronecker-factored approximate curvature. CoRR abs\/1503.05671","author":"MARTENS J.","year":"2015","unstructured":"MARTENS , J. , AND GROSSE , R. B. Optimizing neural networks with kronecker-factored approximate curvature. CoRR abs\/1503.05671 ( 2015 ). MARTENS, J., AND GROSSE, R. B. Optimizing neural networks with kronecker-factored approximate curvature. CoRR abs\/1503.05671 (2015)."},{"key":"e_1_2_1_34_1","volume-title":"Revisiting small batch training for deep neural networks. CoRR abs\/1804.07612","author":"MASTERS D.","year":"2018","unstructured":"MASTERS , D. , AND LUSCHI , C. Revisiting small batch training for deep neural networks. CoRR abs\/1804.07612 ( 2018 ). MASTERS, D., AND LUSCHI, C. Revisiting small batch training for deep neural networks. CoRR abs\/1804.07612 (2018)."},{"key":"e_1_2_1_35_1","volume-title":"An empirical model of large-batch training. arXiv preprint arXiv:1812.06162","author":"MCCANDLISH S.","year":"2018","unstructured":"MCCANDLISH , S. , KAPLAN , J. , AMODEI , D. , AND TEAM , O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162 ( 2018 ). MCCANDLISH, S., KAPLAN, J., AMODEI, D., AND TEAM, O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162 (2018)."},{"key":"e_1_2_1_36_1","volume-title":"Convergence analysis of distributed stochastic gradient descent with shuffling. arXiv preprint arXiv:1709.10432","author":"MENG Q.","year":"2017","unstructured":"MENG , Q. , CHEN , W. , WANG , Y. , MA , Z.-M. , AND LIU , T.-Y. Convergence analysis of distributed stochastic gradient descent with shuffling. arXiv preprint arXiv:1709.10432 ( 2017 ). MENG, Q., CHEN, W., WANG, Y., MA, Z.-M., AND LIU, T.-Y. Convergence analysis of distributed stochastic gradient descent with shuffling. arXiv preprint arXiv:1709.10432 (2017)."},{"key":"e_1_2_1_37_1","unstructured":"NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL) 2018. https:\/\/developer.nvidia.com\/nccl.  NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL) 2018. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_1_38_1","unstructured":"NVLINK FABRIC 2018. https:\/\/www.nvidia. com\/en-us\/data-center\/nvlink\/.  NVLINK FABRIC 2018. https:\/\/www.nvidia. com\/en-us\/data-center\/nvlink\/."},{"key":"e_1_2_1_39_1","first-page":"1","volume-title":"CIDR","volume":"4","author":"PAVLO A.","year":"2017","unstructured":"PAVLO , A. , ANGULO , G. , ARULRAJ , J. , LIN , H. , LIN , J. , MA , L. , MENON , P. , MOWRY , T. C. , PERRON , M. , QUAH , I. , ET AL . Self-driving database management systems . In CIDR ( 2017 ), vol. 4 , p. 1 . PAVLO, A., ANGULO, G., ARULRAJ, J., LIN, H., LIN, J., MA, L., MENON, P., MOWRY, T. C., PERRON, M., QUAH, I., ET AL. Self-driving database management systems. In CIDR (2017), vol. 4, p. 1."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/0041-5553(64)90137-5"},{"key":"e_1_2_1_41_1","volume-title":"New stochastic approximation type procedures. Avtomatica i Telemekhanika 7, 7 (01","author":"POLYAK B.","year":"1990","unstructured":"POLYAK , B. New stochastic approximation type procedures. Avtomatica i Telemekhanika 7, 7 (01 1990 ), 98--107. POLYAK, B. New stochastic approximation type procedures. Avtomatica i Telemekhanika 7, 7 (01 1990), 98--107."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1137\/0330046"},{"key":"e_1_2_1_43_1","volume-title":"The Prometheus monitoring system and time series database. https:\/\/github.com\/prometheus\/ prometheus","author":"PROMETHEUS.","year":"2019","unstructured":"PROMETHEUS. The Prometheus monitoring system and time series database. https:\/\/github.com\/prometheus\/ prometheus , 2019 . Online ; accessed: 2019-05--18. PROMETHEUS. The Prometheus monitoring system and time series database. https:\/\/github.com\/prometheus\/ prometheus, 2019. Online; accessed: 2019-05--18."},{"key":"e_1_2_1_44_1","unstructured":"PYTORCH 2018. https:\/\/pytorch.org.  PYTORCH 2018. https:\/\/pytorch.org."},{"key":"e_1_2_1_45_1","volume-title":"100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250","author":"RAJPURKAR P.","year":"2016","unstructured":"RAJPURKAR , P. , ZHANG , J. , LOPYREV , K. , AND LIANG , P. Squad : 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ( 2016 ). RAJPURKAR, P., ZHANG, J., LOPYREV, K., AND LIANG, P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729586"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2945397"},{"key":"e_1_2_1_49_1","volume-title":"Don't decay the learning rate, increase the batch size. CoRR abs\/1711.00489","author":"SMITH S. L.","year":"2017","unstructured":"SMITH , S. L. , KINDERMANS , P. , AND LE , Q. V. Don't decay the learning rate, increase the batch size. CoRR abs\/1711.00489 ( 2017 ). SMITH, S. L., KINDERMANS, P., AND LE, Q. V. Don't decay the learning rate, increase the batch size. CoRR abs\/1711.00489 (2017)."},{"key":"e_1_2_1_50_1","first-page":"2951","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems -","volume":"2","author":"SNOEK J.","year":"2012","unstructured":"SNOEK , J. , LAROCHELLE , H. , AND ADAMS , R. P. Practical bayesian optimization of machine learning algorithms . In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (USA, 2012 ), NIPS'12, Curran Associates Inc. , pp. 2951 -- 2959 . SNOEK, J., LAROCHELLE, H., AND ADAMS, R. P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (USA, 2012), NIPS'12, Curran Associates Inc., pp. 2951--2959."},{"key":"e_1_2_1_51_1","volume-title":"TensorFlow's Visualization Toolkit. https: \/\/github.com\/tensorflow\/tensorboard","author":"TENSORFLOW.","year":"2019","unstructured":"TENSORFLOW. TensorFlow's Visualization Toolkit. https: \/\/github.com\/tensorflow\/tensorboard , 2019 . Online ; accessed: 2019-05--18. TENSORFLOW. TensorFlow's Visualization Toolkit. https: \/\/github.com\/tensorflow\/tensorboard, 2019. Online; accessed: 2019-05--18."},{"key":"e_1_2_1_52_1","unstructured":"TENSORFLOW BENCHMARKS 2019. https:\/\/github. com\/tensorflow\/benchmarks.  TENSORFLOW BENCHMARKS 2019. https:\/\/github. com\/tensorflow\/benchmarks."},{"key":"e_1_2_1_53_1","volume-title":"Variance-based gradient compression for efficient distributed deep learning. CoRR abs\/1802.06058","author":"TSUZUKU Y.","year":"2018","unstructured":"TSUZUKU , Y. , IMACHI , H. , AND AKIBA , T. Variance-based gradient compression for efficient distributed deep learning. CoRR abs\/1802.06058 ( 2018 ). TSUZUKU, Y., IMACHI, H., AND AKIBA, T. Variance-based gradient compression for efficient distributed deep learning. CoRR abs\/1802.06058 (2018)."},{"key":"e_1_2_1_54_1","volume-title":"Reducing BERT pre-training time from 3 days to 76 minutes. CoRR abs\/1904.00962","author":"YOU Y.","year":"2019","unstructured":"YOU , Y. , LI , J. , HSEU , J. , SONG , X. , DEMMEL , J. , AND HSIEH , C. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR abs\/1904.00962 ( 2019 ). YOU, Y., LI, J., HSEU, J., SONG, X., DEMMEL, J., AND HSIEH, C. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR abs\/1904.00962 (2019)."},{"key":"e_1_2_1_55_1","volume-title":"Scaling SGD batch size to 32K for ImageNet training. CoRR abs\/1708.03888","author":"ZHANG J.","year":"2017","unstructured":"ZHANG , J. , AND MITLIAGKAS , I. Scaling SGD batch size to 32K for ImageNet training. CoRR abs\/1708.03888 ( 2017 ). {56} ZHANG, J. , AND MITLIAGKAS, I. YellowFin and the art of momentum tuning. CoRR abs\/1706.03471 (2017). ZHANG, J., AND MITLIAGKAS, I. Scaling SGD batch size to 32K for ImageNet training. CoRR abs\/1708.03888 (2017). {56} ZHANG, J., AND MITLIAGKAS, I. YellowFin and the art of momentum tuning. CoRR abs\/1706.03471 (2017)."}],"container-title":["ACM SIGOPS Operating Systems Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3352020.3352029","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3352020.3352029","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:26:15Z","timestamp":1750206375000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3352020.3352029"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,25]]},"references-count":54,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,7,25]]}},"alternative-id":["10.1145\/3352020.3352029"],"URL":"https:\/\/doi.org\/10.1145\/3352020.3352029","relation":{},"ISSN":["0163-5980"],"issn-type":[{"value":"0163-5980","type":"print"}],"subject":[],"published":{"date-parts":[[2019,7,25]]},"assertion":[{"value":"2019-07-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}