{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:09:57Z","timestamp":1750219797720,"version":"3.41.0"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,12,16]],"date-time":"2022-12-16T00:00:00Z","timestamp":1671148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Excellent Youth Foundation of Hunan Province","award":["2021JJ10050"],"award-info":[{"award-number":["2021JJ10050"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD) to combine their merits, aiming at tackling the communication bottleneck via communication sparsification. SSD-SGD explores both global synchronous updates in the parameter servers and asynchronous local updates in the workers in each periodic iteration. The periodic and flexible synchronization makes SSD-SGD achieve good convergence accuracy and fast training speed. To the best of our knowledge, we strike the new balance between synchronization quality and communication sparsification, and improve the tradeoff between accuracy and training speed. Specifically, the core components of SSD-SGD include proper warm-up stage, steps delay stage, and the novel algorithm of global gradient for local update (GLU). GLU is critical for local update operations by using global gradient information to effectively compensate for the delayed local weights. Furthermore, we implement SSD-SGD on MXNet framework and comprehensively evaluate its performance with CIFAR-10 and ImageNet datasets. Experimental results show that SSD-SGD can accelerate distributed training speed under different experimental configurations, by up to 110% (or 2.1\u00d7 of the original speed), while achieving good convergence accuracy.<\/jats:p>","DOI":"10.1145\/3563038","type":"journal-article","created":{"date-parts":[[2022,9,14]],"date-time":"2022-09-14T13:23:18Z","timestamp":1663161798000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["SSD-SGD: Communication Sparsification for Distributed Deep Learning Training"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3639-5449","authenticated-orcid":false,"given":"Yemao","family":"Xu","sequence":"first","affiliation":[{"name":"Defense Innovation Institute, Academy of Military Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6243-8479","authenticated-orcid":false,"given":"Dezun","family":"Dong","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3192-7493","authenticated-orcid":false,"given":"Dongsheng","family":"Wang","sequence":"additional","affiliation":[{"name":"Defense Innovation Institute, Academy of Military Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3866-9060","authenticated-orcid":false,"given":"Shi","family":"Xu","sequence":"additional","affiliation":[{"name":"Defense Innovation Institute, Academy of Military Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2661-0889","authenticated-orcid":false,"given":"Enda","family":"Yu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2412-6591","authenticated-orcid":false,"given":"Weixia","family":"Xu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5437-0088","authenticated-orcid":false,"given":"Xiangke","family":"Liao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,12,16]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin et\u00a0al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR abs\/1603.04467 (2016). Retrieved from https:\/\/arxiv.org\/abs\/1603.04467."},{"key":"e_1_3_1_3_2","first-page":"1709","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Alistarh D.","year":"2017","unstructured":"D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proceedings of the Advances in Neural Information Processing Systems. 1709\u20131720."},{"key":"e_1_3_1_4_2","article-title":"Stochastic gradient push for distributed deep learning","author":"Assran M.","year":"2019","unstructured":"M. Assran, N. Loizou, N. Ballas, and M. Rabbat. 2019. Stochastic gradient push for distributed deep learning. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3369583.3392686"},{"key":"e_1_3_1_7_2","unstructured":"Chi-Chung Chen Chia-Lin Yang and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv:1809.02839. Retrieved from https:\/\/arxiv.org\/abs\/1809.02839."},{"key":"e_1_3_1_8_2","article-title":"Revisiting distributed synchronous SGD","author":"Chen J.","year":"2016","unstructured":"J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. 2016. Revisiting distributed synchronous SGD. In Proceedings of the ICLR Workshop.","journal-title":"In Proceedings of the ICLR Workshop."},{"key":"e_1_3_1_9_2","unstructured":"T. Chen M. Li Y. Li M Lin N. Wang M. Wang T. Xiao B. Xu C. Zhang and Z. Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs\/1512.01274 (2015). Retrieved from https:\/\/arxiv.org\/abs\/1512.01274."},{"key":"e_1_3_1_10_2","first-page":"571","volume-title":"Proceedings of the USENIX OSDI","author":"Chilimbi T. M.","year":"2014","unstructured":"T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the USENIX OSDI. 571\u2013582."},{"key":"e_1_3_1_11_2","unstructured":"M. Cho U. Finkler S. Kumar D. S. Kung V. Saxena and D. Sreedhar. 2017. Powerai ddl. CoRR abs\/1708.02188 (2017). Retrieved from https:\/\/arxiv.org\/abs\/1708.02188."},{"key":"e_1_3_1_12_2","first-page":"1337","volume-title":"Proceedings of the ICML","author":"Coates A.","year":"2013","unstructured":"A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the ICML. 1337\u20131345."},{"key":"e_1_3_1_13_2","unstructured":"V. Codreanu D. Podareanu and V. Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. CoRR abs\/1711.04291 (2017). Retrieved from https:\/\/arxiv.org\/abs\/1711.04291."},{"key":"e_1_3_1_14_2","unstructured":"A. Krizhevsky G. Hinton and others. 2009. Learning multiple layers of features from tiny images. Master\u2019s thesis . University of Tront Citeseer 1\u201360."},{"key":"e_1_3_1_15_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012","author":"Dean J.","year":"2012","unstructured":"J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, and P. A. Tucker, et\u00a0al. 2012. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, L\u00e9on Bottou, and Kilian Q. Weinberger (Eds.). Lake Tahoe, Nevada, United States, 1232\u20131240."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00056"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2015.2472014"},{"key":"e_1_3_1_19_2","unstructured":"P. Goyal P. Doll\u00e1r R. B. Girshick P. Noordhuis L. Wesolowski A. Kyrola A. Tulloch Y. Jia and K. He. 2017. Accurate large minibatch SGD: Training imagenet in 1 hour. CoRR abs\/1706.02677 (2017). Retrieved from https:\/\/arxiv.org\/abs\/1706.02677."},{"key":"e_1_3_1_20_2","first-page":"11082","volume-title":"Proceedings of the NeurIPS","author":"Haddadpour F.","year":"2019","unstructured":"F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe. 2019. Local SGD with periodic averaging: Tighter analysis and adaptive synchronization. In Proceedings of the NeurIPS. 11082\u201311094."},{"key":"e_1_3_1_21_2","first-page":"1135","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Han Song","year":"2015","unstructured":"Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems. 1135\u20131143."},{"key":"e_1_3_1_22_2","article-title":"TicTac: Accelerating distributed deep learning with communication scheduling","author":"Hashemi S. H.","year":"2018","unstructured":"S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. 2018. TicTac: Accelerating distributed deep learning with communication scheduling. In Proceedings of the SysML.","journal-title":"In Proceedings of the SysML."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00059"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_25_2","first-page":"1223","volume-title":"Proceedings of the NeurIPS","author":"Ho Q.","year":"2013","unstructured":"Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the NeurIPS. 1223\u20131231."},{"key":"e_1_3_1_26_2","article-title":"Priority-based parameter propagation for distributed DNN training","author":"Jayarajan A.","year":"2019","unstructured":"A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, and G. Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. In Proceedings of the SysML.","journal-title":"In Proceedings of the SysML."},{"key":"e_1_3_1_27_2","unstructured":"X. Jia S. Song W. He Y. Wang H. Rong F. Zhou L. Xie Z. Guo Y. Yang L. Yu et\u00a0al. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR abs\/1807.11205 (2018). Retrieved from https:\/\/arxiv.org\/abs\/1807.11205."},{"key":"e_1_3_1_28_2","first-page":"463","volume-title":"Proceedings of the USENIX OSDI","author":"Jiang Y. M.","year":"2020","unstructured":"Y. M. Jiang, Y. B. Zhu, L. Chang, B. R. Yi, Y. Cui, and C. X. Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU\/CPU clusters. In Proceedings of the USENIX OSDI. 463\u2013479."},{"key":"e_1_3_1_29_2","unstructured":"A. Karpathy J. Johnson and Li F.2015. Visualizing and understanding recurrent networks. CoRR abs\/1506.02078 (2015). Retrieved from https:\/\/arxiv.org\/abs\/1506.02078."},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"J. Kim M. El-Khamy and J. Lee. 2017. Residual LSTM: Design of a deep recurrent architecture for distant speech recognition. In Proceedings of the Interspeech 2017 18th Annual Conference of the International Speech Communication Association F. Lacerda (Ed.). ISCA Stockholm Sweden 1591\u20131595. http:\/\/www.isca-speech.org\/archive\/Interspeech_2017\/abstracts\/0477.html.","DOI":"10.21437\/Interspeech.2017-477"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00085"},{"key":"e_1_3_1_32_2","unstructured":"Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997 (2014). Retrieved from https:\/\/arxiv.org\/abs\/1404.5997."},{"key":"e_1_3_1_33_2","first-page":"1097","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Krizhevsky A.","year":"2012","unstructured":"A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097\u20131105."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322259"},{"key":"e_1_3_1_35_2","volume-title":"Proceedings of the ICLR","author":"Lin T.","year":"2020","unstructured":"T. Lin, S. U Stich, K. K. Patel, and M. Jaggi. 2020. Don\u2019t use large mini-batches, use local SGD. In Proceedings of the ICLR."},{"key":"e_1_3_1_36_2","unstructured":"Y. Lin S. Han H. Mao Y. Wang and B. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the 6th International Conference on Learning Representations (ICLR\u201918) . OpenReview.net Vancouver BC Canada. https:\/\/openreview.net\/forum?id=SkhQHMW0W."},{"key":"e_1_3_1_37_2","article-title":"Heterogeneity-aware asynchronous decentralized training","author":"Luo Q.","year":"2019","unstructured":"Q. Luo, J. He, Y. Zhuo, and X. Qian. 2019. Heterogeneity-aware asynchronous decentralized training. In Proceedings of the ASPLOS.","journal-title":"In Proceedings of the ASPLOS."},{"key":"e_1_3_1_38_2","unstructured":"Eran Malach Gilad Yehudai Shai Shalev-Shwartz and Ohad Shamir. 2020. Proving the lottery ticket hypothesis: Pruning is all you need. In Proceedings of the 37th International Conference on Machine Learning (ICML\u201920) . Vol. 119 PMLR 6682\u20136691. http:\/\/proceedings.mlr.press\/v119\/malach20a.html."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_1_40_2","unstructured":"NVLink and NVSwitch. 2020. Retrieved 17 October 2022 from https:\/\/www.nvidia.com\/en-us\/datacenter\/nvlink\/."},{"key":"e_1_3_1_41_2","unstructured":"Google Offers Glimpse of Third-Generation TPU Processor. 2018. Retrieved 11 May 2018 from https:\/\/www.top500.org\/news\/google-offers-glimpse-of-third-generation-tpu-processor\/."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33014780"},{"key":"e_1_3_1_44_2","unstructured":"A. Sapio M. Canini C. Ho J. Nelson P. Kalnis C. Kim A. Krishnamurthy M. Moshref Dan R. K. Ports and P. Richt\u00e1rik. 2021. Scaling distributed machine learning with In-Network aggregation. In Proceedings of the 18th (USENIX) Symposium on Networked Systems Design and Implementation (NSDI\u201921) J. Mickens and R. Teixeira (Eds.). USENIX Association 785\u2013808. https:\/\/www.usenix.org\/conference\/nsdi21\/presentation\/sapio."},{"key":"e_1_3_1_45_2","unstructured":"A. Sergeev and M. D. Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799 (2018). Retrieved from https:\/\/arxiv.org\/abs\/1802.05799."},{"key":"e_1_3_1_46_2","unstructured":"K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_3_1_47_2","first-page":"957","volume-title":"Proceedings of the Artificial Intelligence and Statistics","author":"Sra S.","year":"2016","unstructured":"S. Sra, A. W. Yu, M. Li, and A. Smola. 2016. Adadelay: Delay adaptive distributed stochastic optimization. In Proceedings of the Artificial Intelligence and Statistics. 957\u2013965."},{"key":"e_1_3_1_48_2","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Stich Sebastian Urban","year":"2019","unstructured":"Sebastian Urban Stich. 2019. Local SGD converges fast and communicates little. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","unstructured":"P. Sun W. Feng R. Han S. Yan and Y. Wen. 2019. Optimizing network performance for distributed DNN training on GPU clusters: ImageNet\/AlexNet training in 1.5 minutes. CoRR abs\/1902.06855 (2019). Retrieved from https:\/\/arxiv.org\/abs\/1902.06855.","DOI":"10.1109\/TBDATA.2019.2957478"},{"key":"e_1_3_1_50_2","unstructured":"A. Gibiansky. 2017. Bringing HPC techniques to deep learning. Baidu Research Tech. Rep. Retrieved from http:\/\/andrew.gibiansky.com."},{"key":"e_1_3_1_51_2","volume-title":"Proceedings of the SysML","author":"Wang J.","year":"2019","unstructured":"J. Wang and G. Joshi. 2019. Adaptive communication strategies to achieve the best error-runtime tradeoff in local-update SGD. In Proceedings of the SysML."},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3369583.3392681"},{"key":"e_1_3_1_53_2","first-page":"4238","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Wang S.","year":"2018","unstructured":"S. Wang, D. Li, Y. Cheng, J. Geng, Y. Wang, S. Wang, S. Xia, and J. Wu. 2018. Bml: A high-performance, low-cost gradient synchronization algorithm for dml training. In Proceedings of the Advances in Neural Information Processing Systems. 4238\u20134248."},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM41043.2020.9155282"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6910"},{"key":"e_1_3_1_56_2","first-page":"1509","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Wen W.","year":"2017","unstructured":"W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proceedings of the Advances in Neural Information Processing Systems. 1509\u20131519."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3417607"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3417607"},{"key":"e_1_3_1_59_2","unstructured":"Y. You I. Gitman and B. Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 . Retrieved from https:\/\/arxiv.org\/abs\/1708.03888."},{"key":"e_1_3_1_60_2","volume-title":"Proceedings of the ICLR","author":"You Y.","year":"2019","unstructured":"Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. J. Hsieh. 2019. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the ICLR."},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2913833"},{"key":"e_1_3_1_62_2","first-page":"175","volume-title":"Proceedings of the MICRO","author":"Youjie Li","year":"2018","unstructured":"Li Youjie, Park Jongse, Alian Mohammad, Yuan Yifan, Qu Zheng, Pan Peitian, Wang Ren, A. G. Schwing, Esmaeilzadeh Hadi, and N. S. Kim. 2018. A network-centric hardware\/algorithm co-design to accelerate distributed training of deep neural networks. In Proceedings of the MICRO. 175\u2013188."},{"key":"e_1_3_1_63_2","first-page":"8056","volume-title":"Proceedings of the NeurIPS","author":"Youjie Li","year":"2018","unstructured":"Li Youjie, Yu Mingchao, Li Songze, Yu Mingchao, Li Songze, Avestimehr Salman, Namsung Kim, and A. G. Schwing. 2018. Pipe-SGD: A decentralized pipelined SGD framework for distributed deep net training. In Proceedings of the NeurIPS. 8056\u20138067."},{"key":"e_1_3_1_64_2","article-title":"On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization","author":"Yu H.","year":"2019","unstructured":"H. Yu, R. Jin, and S. Yang. 2019. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_1_65_2","first-page":"5123","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Yu M.","year":"2018","unstructured":"M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr. 2018. Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training. In Proceedings of the Advances in Neural Information Processing Systems. 5123\u20135133."},{"key":"e_1_3_1_66_2","article-title":"Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters","author":"Zhang H.","year":"2017","unstructured":"H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and Eric P.2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX ATC.","journal-title":"In Proceedings of the USENIX ATC."},{"key":"e_1_3_1_67_2","unstructured":"Jian Zhang Christopher De Sa Ioannis Mitliagkas and Christopher R\u00e9. 2016. Parallel SGD: When does averaging help? CoRR abs\/1606.07365 (2016). Retrieved from https:\/\/arxiv.org\/abs\/1606.07365."},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2019.00150"},{"key":"e_1_3_1_69_2","first-page":"4120","volume-title":"Proceedings of the ICML","author":"Zheng S.","year":"2017","unstructured":"S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z. Ma, and T. Liu. 2017. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the ICML. 4120\u20134129."},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/447"},{"key":"e_1_3_1_71_2","volume-title":"Proceedings of the ICML","author":"Zhou Z.","year":"2018","unstructured":"Z. Zhou, P. Mertikopoulos, N. Bambos, P. W. Glynn, Y. Ye, L. Li, and F. Li. 2018. Distributed asynchronous optimization with unbounded delays: How slow can you go?. In Proceedings of the ICML."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563038","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3563038","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:38:09Z","timestamp":1750178289000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563038"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,16]]},"references-count":70,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3563038"],"URL":"https:\/\/doi.org\/10.1145\/3563038","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2022,12,16]]},"assertion":[{"value":"2022-01-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-28","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}