{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T13:35:03Z","timestamp":1768311303841,"version":"3.49.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2020,9,30]],"date-time":"2020-09-30T00:00:00Z","timestamp":1601424000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"High-end Generic Chips and Basic Software","award":["2018ZX01028101"],"award-info":[{"award-number":["2018ZX01028101"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,12,31]]},"abstract":"<jats:p>The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly applied in the distributed training process, i.e., the Synchronous SGD algorithm (SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence point while the training speed is slowed down by the synchronous barrier. ASGD has faster training speed but the convergence point is lower when compared to SSGD. To sufficiently utilize the advantages of SSGD and ASGD, we propose a novel technology named One-step Delay SGD (OD-SGD) to combine their strengths in the training process. Therefore, we can achieve similar convergence point and training speed as SSGD and ASGD separately.<\/jats:p>\n          <jats:p>To the best of our knowledge, we make the first attempt to combine the features of SSGD and ASGD to improve distributed training performance. Each iteration of OD-SGD contains a global update in the parameter server node and local updates in the worker nodes, the local update is introduced to update and compensate the delayed local weights. We evaluate our proposed algorithm on MNIST, CIFAR-10, and ImageNet datasets. Experimental results show that OD-SGD can obtain similar or even slightly better accuracy than SSGD, while its training speed is much faster, which even exceeds the training speed of ASGD.<\/jats:p>","DOI":"10.1145\/3417607","type":"journal-article","created":{"date-parts":[[2020,9,30]],"date-time":"2020-09-30T11:23:50Z","timestamp":1601465030000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["OD-SGD"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3639-5449","authenticated-orcid":false,"given":"Yemao","family":"Xu","sequence":"first","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Dezun","family":"Dong","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Yawei","family":"Zhao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Weixia","family":"Xu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Xiangke","family":"Liao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2020,9,30]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S. Corrado , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Ian Goodfellow , Andrew Harp , Geoffrey Irving , Michael Isard , Yangqing Jia , Rafal Jozefowicz , Lukasz Kaiser , Manjunath Kudlur , Josh Levenberg , Dan Man\u00e9 , Rajat Monga , Sherry Moore , Derek Murray , Chris Olah , Mike Schuster , Jonathon Shlens , Benoit Steiner , Ilya Sutskever , Kunal Talwar , Paul Tucker , Vincent Vanhoucke , Vijay Vasudevan , Fernanda Vi\u00e9gas , Oriol Vinyals , Pete Warden , Martin Wattenberg , Martin Wicke , Yuan Yu , and Xiaoqiang Zheng . 2016 . Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016). Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)."},{"key":"e_1_2_1_2_1","volume-title":"Duchi","author":"Agarwal Alekh","year":"2011","unstructured":"Alekh Agarwal and John C . Duchi . 2011 . Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems . 873--881. Alekh Agarwal and John C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881."},{"key":"e_1_2_1_3_1","volume-title":"Extremely large minibatch SGD: Training Resnet-50 on Imagenet in 15 minutes. arXiv preprint arXiv:1711.04325","author":"Akiba Takuya","year":"2017","unstructured":"Takuya Akiba , Shuji Suzuki , and Keisuke Fukuda . 2017. Extremely large minibatch SGD: Training Resnet-50 on Imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 ( 2017 ). Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large minibatch SGD: Training Resnet-50 on Imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017)."},{"key":"e_1_2_1_4_1","volume-title":"QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720.","author":"Alistarh Dan","year":"2017","unstructured":"Dan Alistarh , Demjan Grubic , Jerry Li , Ryota Tomioka , and Milan Vojnovic . 2017 . QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709--1720."},{"key":"e_1_2_1_5_1","volume-title":"Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792","author":"Assran Mahmoud","year":"2018","unstructured":"Mahmoud Assran , Nicolas Loizou , Nicolas Ballas , and Michael Rabbat . 2018. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792 ( 2018 ). Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. 2018. Stochastic gradient push for distributed deep learning. arXiv preprint arXiv:1811.10792 (2018)."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205","author":"Awan Ammar Ahmad","unstructured":"Ammar Ahmad Awan , Khaled Hamidouche , Jahanzeb Maqbool Hashmi , and Dhabaleswar K. Panda . 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters . In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205 . Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205."},{"key":"e_1_2_1_7_1","volume-title":"Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981","author":"Chen Jianmin","year":"2016","unstructured":"Jianmin Chen , Xinghao Pan , Rajat Monga , Samy Bengio , and Rafal Jozefowicz . 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 ( 2016 ). Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 (2016)."},{"key":"e_1_2_1_8_1","volume-title":"Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274","author":"Chen Tianqi","year":"2015","unstructured":"Tianqi Chen , Mu Li , Yutian Li , Min Lin , Naiyan Wang , Minjie Wang , Tianjun Xiao , Bing Xu , Chiyuan Zhang , and Zheng Zhang . 2015 . Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015). Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2996864"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_2_1_11_1","volume-title":"Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News","author":"Chi Ping","year":"2016","unstructured":"Ping Chi , Shuangchen Li , Cong Xu , Tao Zhang , Jishen Zhao , Yongpan Liu , Yu Wang , and Yuan Xie . 2016 . Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News , Vol. 44 . IEEE Press , 27--39. Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 27--39."},{"key":"e_1_2_1_12_1","first-page":"571","article-title":"Project Adam: Building an efficient and scalable deep learning training system","volume":"14","author":"Chilimbi Trishul M.","year":"2014","unstructured":"Trishul M. Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project Adam: Building an efficient and scalable deep learning training system . In OSDI , Vol. 14. 571 -- 582 . Trishul M. Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In OSDI, Vol. 14. 571--582.","journal-title":"OSDI"},{"key":"e_1_2_1_13_1","volume-title":"arXiv preprint arXiv:1708.02188","author":"Cho Minsik","year":"2017","unstructured":"Minsik Cho , Ulrich Finkler , Sameer Kumar , David Kung , Vaibhav Saxena , and Dheeraj Sreedhar . 2017. Powerai DDL. arXiv preprint arXiv:1708.02188 ( 2017 ). Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, and Dheeraj Sreedhar. 2017. Powerai DDL. arXiv preprint arXiv:1708.02188 (2017)."},{"key":"e_1_2_1_14_1","volume-title":"Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291","author":"Codreanu Valeriu","year":"2017","unstructured":"Valeriu Codreanu , Damian Podareanu , and Vikram Saletore . 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 ( 2017 ). Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)."},{"key":"e_1_2_1_15_1","unstructured":"The CIFAR-10 dataset. 2019. https:\/\/www.cs.toronto.edu\/kriz\/cifar.html.  The CIFAR-10 dataset. 2019. https:\/\/www.cs.toronto.edu\/kriz\/cifar.html."},{"key":"e_1_2_1_16_1","volume-title":"High-accuracy low-precision training. arXiv preprint arXiv:1803.03383","author":"Sa Christopher De","year":"2018","unstructured":"Christopher De Sa , Megan Leszczynski , Jian Zhang , Alana Marzoev , Christopher R. Aberger , Kunle Olukotun , and Christopher R\u00e9. 2018. High-accuracy low-precision training. arXiv preprint arXiv:1803.03383 ( 2018 ). Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, and Christopher R\u00e9. 2018. High-accuracy low-precision training. arXiv preprint arXiv:1803.03383 (2018)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_18_1","volume-title":"large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He. 2017. Accurate , large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 ( 2017 ). Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)."},{"key":"e_1_2_1_19_1","unstructured":"CUDA GPUs. 2019. https:\/\/developer.nvidia.com\/cuda-gpus.  CUDA GPUs. 2019. https:\/\/developer.nvidia.com\/cuda-gpus."},{"key":"e_1_2_1_20_1","unstructured":"Cuda GPUs. 2019. https:\/\/developer.nvidia.com\/cuda-gpus.  Cuda GPUs. 2019. https:\/\/developer.nvidia.com\/cuda-gpus."},{"key":"e_1_2_1_21_1","volume-title":"International Conference on Machine Learning. 1737--1746","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In International Conference on Machine Learning. 1737--1746 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_23_1","volume-title":"Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard , Menglong Zhu , Bo Chen , Dmitry Kalenichenko , Weijun Wang , Tobias Weyand , Marco Andreetto , and Hartwig Adam . 2017 . Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017). Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)."},{"key":"e_1_2_1_24_1","volume-title":"et\u00a0al","author":"Jia Xianyan","year":"2018","unstructured":"Xianyan Jia , Shutao Song , Wei He , Yangzihao Wang , Haidong Rong , Feihu Zhou , Liqiang Xie , Zhenyu Guo , Yuanzhou Yang , Liwei Yu , et\u00a0al . 2018 . Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018). Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et\u00a0al. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2833074"},{"key":"e_1_2_1_28_1","unstructured":"Yann LeCun. 1998. The MNIST database of handwritten digits. http:\/\/yann.lecun.com\/exdb\/mnist\/ (1998).  Yann LeCun. 1998. The MNIST database of handwritten digits. http:\/\/yann.lecun.com\/exdb\/mnist\/ (1998)."},{"key":"e_1_2_1_29_1","volume-title":"Dally","author":"Lin Yujun","year":"2017","unstructured":"Yujun Lin , Song Han , Huizi Mao , Yu Wang , and William J . Dally . 2017 . Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017). Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017)."},{"key":"e_1_2_1_30_1","volume-title":"et\u00a0al","author":"Micikevicius Paulius","year":"2017","unstructured":"Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh , et\u00a0al . 2017 . Mixed precision training. arXiv preprint arXiv:1710.03740 (2017). Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et\u00a0al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)."},{"key":"e_1_2_1_31_1","volume-title":"Jordan","author":"Moritz Philipp","year":"2015","unstructured":"Philipp Moritz , Robert Nishihara , Ion Stoica , and Michael I . Jordan . 2015 . Sparknet : Training de ep networks in spark. arXiv preprint arXiv:1511.06051 (2015). Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2015. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051 (2015)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00063"},{"key":"e_1_2_1_33_1","unstructured":"PyTorch. 2019. https:\/\/pytorch.org\/features.  PyTorch. 2019. https:\/\/pytorch.org\/features."},{"key":"e_1_2_1_34_1","volume-title":"Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso . 2018 . Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018). Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)."},{"key":"e_1_2_1_35_1","volume-title":"Le","author":"Smith Samuel L.","year":"2017","unstructured":"Samuel L. Smith , Pieter-Jan Kindermans , Chris Ying , and Quoc V . Le . 2017 . Don\u2019t de cay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017). Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2017. Don\u2019t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017)."},{"key":"e_1_2_1_36_1","volume-title":"Optimizing network performance for distributed DNN training on GPU clusters: ImageNet\/AlexNet training in 1.5 minutes. arXiv preprint arXiv:1902.06855","author":"Sun Peng","year":"2019","unstructured":"Peng Sun , Wansen Feng , Ruobing Han , Shengen Yan , and Yonggang Wen . 2019. Optimizing network performance for distributed DNN training on GPU clusters: ImageNet\/AlexNet training in 1.5 minutes. arXiv preprint arXiv:1902.06855 ( 2019 ). Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, and Yonggang Wen. 2019. Optimizing network performance for distributed DNN training on GPU clusters: ImageNet\/AlexNet training in 1.5 minutes. arXiv preprint arXiv:1902.06855 (2019)."},{"key":"e_1_2_1_37_1","unstructured":"Baidu. 2017. Bringing HPC Techniques to Deep Learning. https:\/\/andrew.gibiansky.com\/blog\/machine-learning\/baidu-allreduce\/.  Baidu. 2017. Bringing HPC Techniques to Deep Learning. https:\/\/andrew.gibiansky.com\/blog\/machine-learning\/baidu-allreduce\/."},{"key":"e_1_2_1_38_1","unstructured":"TPU. 2019. https:\/\/www.nextplatform.com\/2018\/05\/10\/tearing-apart-googles-tpu-3-0-ai-coprocessor\/.  TPU. 2019. https:\/\/www.nextplatform.com\/2018\/05\/10\/tearing-apart-googles-tpu-3-0-ai-coprocessor\/."},{"key":"e_1_2_1_39_1","volume-title":"Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519.","author":"Wen Wei","year":"2017","unstructured":"Wei Wen , Cong Xu , Feng Yan , Chunpeng Wu , Yandan Wang , Yiran Chen , and Hai Li . 2017 . Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519. Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1509--1519."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2015.2472014"},{"key":"e_1_2_1_41_1","first-page":"7","article-title":"SketchDLC: A sketch on distributed deep learning communication via trace capturing","volume":"16","author":"Xu Yemao","year":"2019","unstructured":"Yemao Xu , Dezun Dong , Weixia Xu , and Xiangke Liao . 2019 . SketchDLC: A sketch on distributed deep learning communication via trace capturing . ACM Transactions on Architecture and Code Optimization (TACO) 16 , 2 (2019), 7 . Yemao Xu, Dezun Dong, Weixia Xu, and Xiangke Liao. 2019. SketchDLC: A sketch on distributed deep learning communication via trace capturing. ACM Transactions on Architecture and Code Optimization (TACO) 16, 2 (2019), 7.","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"key":"e_1_2_1_42_1","volume-title":"Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888","author":"You Yang","year":"2017","unstructured":"Yang You , Igor Gitman , and Boris Ginsburg . 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 ( 2017 ). Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 (2017)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225069"},{"key":"e_1_2_1_44_1","volume-title":"Alexander Schwing, Murali Annavaram, and Salman Avestimehr.","author":"Yu Mingchao","year":"2018","unstructured":"Mingchao Yu , Zhifeng Lin , Krishna Narra , Songze Li , Youjie Li , Nam Sung Kim , Alexander Schwing, Murali Annavaram, and Salman Avestimehr. 2018 . Gradiveq : Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training. In Advances in Neural Information Processing Systems . 5123--5133. Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. 2018. Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training. In Advances in Neural Information Processing Systems. 5123--5133."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305890.3306107"},{"key":"e_1_2_1_46_1","volume-title":"Smola","author":"Zinkevich Martin","year":"2010","unstructured":"Martin Zinkevich , Markus Weimer , Lihong Li , and Alex J . Smola . 2010 . Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems . 2595--2603. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems. 2595--2603."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3417607","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3417607","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:14Z","timestamp":1750197674000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3417607"}},"subtitle":["One-Step Delay Stochastic Gradient Descent for Distributed Training"],"short-title":[],"issued":{"date-parts":[[2020,9,30]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,12,31]]}},"alternative-id":["10.1145\/3417607"],"URL":"https:\/\/doi.org\/10.1145\/3417607","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,30]]},"assertion":[{"value":"2019-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-09-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}