{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T23:21:52Z","timestamp":1777936912211,"version":"3.51.4"},"reference-count":105,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2023,9,29]],"date-time":"2023-09-29T00:00:00Z","timestamp":1695945600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62172008 and 62102009"],"award-info":[{"award-number":["62172008 and 62102009"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Natural Science Fund for the Excellent Young Scientists Fund Program"},{"name":"Center for Data Space Technology and System, Peking University"},{"name":"ERC Advanced Grant","award":["741278"],"award-info":[{"award-number":["741278"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2023,11,30]]},"abstract":"<jats:p>\n            <jats:bold>Deep learning (DL)<\/jats:bold>\n            has become a key component of modern software. In the \u201c\n            <jats:italic>big model<\/jats:italic>\n            \u201d era, the rich features of DL-based software (i.e., DL software) substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices (e.g., a cluster of GPUs) in the training process, which is known as\n            <jats:italic>distributed deep learning training<\/jats:italic>\n            , or\n            <jats:italic>\n              <jats:bold>distributed training<\/jats:bold>\n            <\/jats:italic>\n            for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers\u2019 issues in distributed training. To this end, we focus on popular DL frameworks that support distributed training (including TensorFlow, PyTorch, Keras, and Horovod) and analyze 1,131 real-world developers\u2019 issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. We find that: (1) many distributed-specific faults and non-distributed-specific faults inherently share the same fault symptoms, making it challenging to debug; (2) most of the fault symptoms have frequent fix patterns; (3) about half of the faults are related to system-level configurations. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.\n          <\/jats:p>","DOI":"10.1145\/3597204","type":"journal-article","created":{"date-parts":[[2023,5,13]],"date-time":"2023-05-13T11:15:54Z","timestamp":1683976554000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective"],"prefix":"10.1145","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7908-8484","authenticated-orcid":false,"given":"Xuanzhe","family":"Liu","sequence":"first","affiliation":[{"name":"Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3591-3892","authenticated-orcid":false,"given":"Diandian","family":"Gu","sequence":"additional","affiliation":[{"name":"Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4765-1893","authenticated-orcid":false,"given":"Zhenpeng","family":"Chen","sequence":"additional","affiliation":[{"name":"University College London, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3023-1005","authenticated-orcid":false,"given":"Jinfeng","family":"Wen","sequence":"additional","affiliation":[{"name":"Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4209-9451","authenticated-orcid":false,"given":"Zili","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7866-4075","authenticated-orcid":false,"given":"Yun","family":"Ma","sequence":"additional","affiliation":[{"name":"Peking University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1100-8633","authenticated-orcid":false,"given":"Haoyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8741-5847","authenticated-orcid":false,"given":"Xin","family":"Jin","sequence":"additional","affiliation":[{"name":"Peking University, China"}]}],"member":"320","published-online":{"date-parts":[[2023,9,29]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2012. Parallel array or array of structures [closed]. Retrieved on December 21 2022https:\/\/stackoverflow.com\/questions\/13239607. (2012)."},{"key":"e_1_3_2_3_2","unstructured":"2016. Distributed tensorflow on localhosts failed by \u201csocket error connection refused\u201d. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/38937984. (2016)."},{"key":"e_1_3_2_4_2","unstructured":"2016. Synchronous vs asynchronous computation in Tensorflow. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/34349316\/synchronous-vs-asynchronous-computation-in-tensorflow. (2016)."},{"key":"e_1_3_2_5_2","unstructured":"2016. Why neural network tends to output \u201cmean value\u201d?Retrieved on December 21 2022https:\/\/stackoverflow.com\/questions\/39863606. (2016)."},{"key":"e_1_3_2_6_2","unstructured":"2017. Baidu-Allreduce. Retrieved on March 16 2022https:\/\/github.com\/baidu-research\/baidu-allreduce. (2017)."},{"key":"e_1_3_2_7_2","unstructured":"2017. CUDA_ERROR_OUT_OF_MEMORY: How to activate multiple GPUs from Keras in Tensorflow. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/45546737. (2017)."},{"key":"e_1_3_2_8_2","unstructured":"2017. Horovod\u2019s Work Pattern?Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/117. (2017)."},{"key":"e_1_3_2_9_2","unstructured":"2017. How to use multiple GPUs effectively when training deep networks?Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/43236349. (2017)."},{"key":"e_1_3_2_10_2","unstructured":"2017. Keras predict not working for multiple GPU\u2019s. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/43620478. (2017)."},{"key":"e_1_3_2_11_2","unstructured":"2017. Memory management when using GPU in TensorFlow?Retrieved on December 21 2022https:\/\/stackoverflow.com\/questions\/42307975. (2017)."},{"key":"e_1_3_2_12_2","unstructured":"2017. Run distributed : ERROR: ORTE_ERROR_LOG: Data unpack would read past end of buffer. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/133. (2017)."},{"key":"e_1_3_2_13_2","unstructured":"2017. TensorFlow: Is There a Rule to Set the Port of Worker\/PS When Creating ClusterSpec?Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/41649708\/tensorflow-is-there-a-rule-to-set-the-port-of-worker-ps-when-creating-clustersp. (2017)."},{"key":"e_1_3_2_14_2","unstructured":"2018. Difference Between Parallel and DistributedRetrieved on April 18 2023https:\/\/www.differencebetween.com\/difference-between-parallel-and-vs-distributed-computing\/. (2018)."},{"key":"e_1_3_2_15_2","unstructured":"2018. Horovod dosn\u2019t work with CUDA 9.1. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/161. (2018)."},{"key":"e_1_3_2_16_2","unstructured":"2018. Horovod hangs with multi gpus on one machine. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/638. (2018)."},{"key":"e_1_3_2_17_2","unstructured":"2018. Introducing HorovodRunner for Distributed Deep Learning Training. Retrieved on March 16 2022https:\/\/databricks.com\/blog\/2018\/11\/19\/introducing-horovodrunner-for-distributed-deep-learning-training.html. (2018)."},{"key":"e_1_3_2_18_2","unstructured":"2018. NVIDIA: Accelerating Deep Learning with Uber\u2019s Horovod. Retrieved on March 16 2022https:\/\/eng.uber.com\/nvidia-horovod-deep-learning\/. (2018)."},{"key":"e_1_3_2_19_2","unstructured":"2018. Open Source at Uber: Meet Alex Sergeev Horovod Project Lead. Retrieved on March 16 2022https:\/\/eng.uber.com\/alex-sergeev-horovod\/. (2018)."},{"key":"e_1_3_2_20_2","unstructured":"2018. Permission denied (publickey password) when I run on muti node. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/467. (2018)."},{"key":"e_1_3_2_21_2","unstructured":"2019. AI and Compute. Retrieved on March 16 2022https:\/\/openai.com\/blog\/ai-and-compute\/. (2019)."},{"key":"e_1_3_2_22_2","unstructured":"2019. Distributed Deep Learning with Horovod. Retrieved on March 16 2022https:\/\/developer.download.nvidia.cn\/video\/gputechconf\/gtc\/2019\/presentation\/s9321-distributed-deep-learning-with-horovod.pdf. (2019)."},{"key":"e_1_3_2_23_2","unstructured":"2019. Fabric for Deep Learning (FfDL). Retrieved on March 16 2022https:\/\/github.com\/IBM\/FfDL. (2019)."},{"key":"e_1_3_2_24_2","unstructured":"2019. Horovod pypi release doesn\u2019t have horovod.spark package. Retrieved on December 19 2022https:\/\/github.com\/horovod\/horovod\/issues\/818. (2019)."},{"key":"e_1_3_2_25_2","unstructured":"2019. NCCL. Retrieved on March 16 2022https:\/\/developer.nvidia.com\/nccl. (2019)."},{"key":"e_1_3_2_26_2","unstructured":"2019. Script freezes with no output when using DistributedDataParallel. Retrieved on March 16 2022https:\/\/github.com\/pytorch\/pytorch\/issues\/22834. (2019)."},{"key":"e_1_3_2_27_2","unstructured":"2019. torch.distributed.launch receives RuntimeError: ProcessGroupNCCL does not support barrier. Retrieved on March 16 2022https:\/\/github.com\/pytorch\/pytorch\/issues\/17848. (2019)."},{"key":"e_1_3_2_28_2","unstructured":"2019. Where does the documentation point to a list of values for the loss property of the compile function?Retrieved on December 21 2022https:\/\/stackoverflow.com\/questions\/57244733. (2019)."},{"key":"e_1_3_2_29_2","unstructured":"2020. AssertionError: Default process group is not initialized. Retrieved on March 16 2022https:\/\/github.com\/pytorch\/pytorch\/issues\/38300. (2020)."},{"key":"e_1_3_2_30_2","unstructured":"2020. CIFAR Scaling Efficiency. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/2103. (2020)."},{"key":"e_1_3_2_31_2","unstructured":"2020. Installation issue with MXNet built from source - No such file or dictionary <dmlc\/base.h>. Retrieved on March 16 2022.https:\/\/github.com\/horovod\/horovod\/issues\/1910. (2020)."},{"key":"e_1_3_2_32_2","unstructured":"2020. Open MPI: Open Source High Performance Computing. Retrieved on March 16 2022https:\/\/www.open-mpi.org. (2020)."},{"key":"e_1_3_2_33_2","unstructured":"2020. Popular Deep Learning Frameworks: An Overview. Retrieved on March 16 2022https:\/\/analyticsindiamag.com\/deep-learning-frameworks\/. (2020)."},{"key":"e_1_3_2_34_2","unstructured":"2020. Pytorch DataParallel doesn\u2019t work when the model contain tensor operation. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/60799655. (2020)."},{"key":"e_1_3_2_35_2","unstructured":"2020. What does \u201cwith strategy.scope():\u201d or \u201cwith tf.distribute.experimental.TPUStrategy(tpu).scope():\u201d do to the creation of a NN?Retrieved on January 16 2023https:\/\/stackoverflow.com\/questions\/65358676. (2020)."},{"key":"e_1_3_2_36_2","unstructured":"2020. When Build Docker Container with Ubuntu16.04 Install Horovod Failed with Error Code -4. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/1798. (2020)."},{"key":"e_1_3_2_37_2","unstructured":"2020. When build docker container with ubuntu16.04 install horovod failed with error code -4. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/1798. (2020)."},{"key":"e_1_3_2_38_2","unstructured":"2021. does it automatically use multiple gpu if availabe?Retrieved on Feburary 11 2023https:\/\/github.com\/keras-team\/keras\/issues\/106. (2021)."},{"key":"e_1_3_2_39_2","unstructured":"2021. Github Search API. Retrieved on March 16 2022https:\/\/developer.github.com\/v3\/search\/. (2021)."},{"key":"e_1_3_2_40_2","unstructured":"2021. Gloo. Retrieved on March 16 2022https:\/\/github.com\/facebookincubator\/gloo. (2021)."},{"key":"e_1_3_2_41_2","unstructured":"2021. GPT-3 Powers the Next Generation of Apps. Retrieved on March 16 2022https:\/\/openai.com\/blog\/gpt-3-apps\/. (2021)."},{"key":"e_1_3_2_42_2","unstructured":"2021. Horovod. Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod. (2021)."},{"key":"e_1_3_2_43_2","unstructured":"2021. I meet deadlock problem when use horovod.Retrieved on March 16 2022https:\/\/github.com\/horovod\/horovod\/issues\/2506. (2021)."},{"key":"e_1_3_2_44_2","unstructured":"2021. init_rpc: TENSOR_PIPE backend sigaborts when CUDA is not available. Retrieved on December 19 2022https:\/\/github.com\/pytorch\/pytorch\/issues\/54266. (2021)."},{"key":"e_1_3_2_45_2","unstructured":"2021. Keras: Deep Learning for Python. Retrieved on March 16 2022https:\/\/github.com\/keras-team\/keras. (2021)."},{"key":"e_1_3_2_46_2","unstructured":"2021. PaddlePaddle. Retrieved on March 16 2022https:\/\/github.com\/PaddlePaddle\/Paddle. (2021)."},{"key":"e_1_3_2_47_2","unstructured":"2021. PyTorch. Retrieved on March 16 2022https:\/\/github.com\/pytorch\/pytorch. (2021)."},{"key":"e_1_3_2_48_2","unstructured":"2021. Running a Basic Distributed MNIST Solver in TensorFlow. Retrieved on March 16 2022https:\/\/stackoverflow.com\/questions\/49984317\/running-a-basic-distributed-mnist-solver-in-tensorflow. (2021)."},{"key":"e_1_3_2_49_2","unstructured":"2021. Stack Exchange Data Dump. Retrieved on December 6 2021https:\/\/archive.org\/details\/stackexchange. (2021)."},{"key":"e_1_3_2_50_2","unstructured":"2021. TensorFlow. Retrieved on March 16 2022https:\/\/github.com\/tensorflow\/tensorflow. (2021)."},{"key":"e_1_3_2_51_2","unstructured":"2021. Top 5 Deep Learning Frameworks You Should Try in 2021 . Retrieved on March 16 2022https:\/\/nexart.tech\/blog\/top-10-deep-learning-frameworks-you-should-try-it-in-2021\/"},{"key":"e_1_3_2_52_2","unstructured":"2021. Top 5 Deep Learning Frameworks in 2021 . Retrieved on March 16 2022 https:\/\/makeinbusiness.com\/top-5-deep-learning-frameworks\/"},{"key":"e_1_3_2_53_2","unstructured":"2022. Top 10 Deep Learning Frameworks in 2022 You Can\u2019t Ignore . Retrieved on March 16 2022 https:\/\/www.upgrad.com\/blog\/top-deep-learning-frameworks\/"},{"key":"e_1_3_2_54_2","unstructured":"2023. Distributed training of deep learning models on Azure. Retrieved on April 18 2023https:\/\/learn.microsoft.com\/en-us\/azure\/architecture\/reference-architectures\/ai\/training-deep-learning. (2023)."},{"key":"e_1_3_2_55_2","unstructured":"2023. Supplemental Materials. Retrieved on April 22 2023https:\/\/github.com\/gudiandian\/TOSEM23-DistributedTraining. (2023)."},{"key":"e_1_3_2_56_2","first-page":"265","volume-title":"Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016. 265\u2013283."},{"key":"e_1_3_2_57_2","first-page":"1199","volume-title":"Proceedings of the 41st International Conference on Software Engineering, ICSE 2019","author":"Aghajani Emad","year":"2019","unstructured":"Emad Aghajani, Csaba Nagy, Olga Lucero Vega-M\u00e1rquez, Mario Linares-V\u00e1squez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software documentation issues unveiled. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019. 1199\u20131210."},{"key":"e_1_3_2_58_2","volume-title":"Concurrency Bugs: Characterization, Debugging and Runtime Verification","author":"Asadollah Sara Abbaspour","year":"2018","unstructured":"Sara Abbaspour Asadollah. 2018. Concurrency Bugs: Characterization, Debugging and Runtime Verification. Ph.D. Dissertation. M\u00e4lardalen University College, V\u00e4ster\u00e5s, Eskilstuna, Sweden."},{"key":"e_1_3_2_59_2","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1145\/3477132.3483553","volume-title":"Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SIGOPS 2021","author":"Bai Youhui","year":"2021","unstructured":"Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel DNN training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SIGOPS 2021. 359\u2013375."},{"issue":"4","key":"e_1_3_2_60_2","first-page":"65:1\u201365:43","article-title":"Demystifying parallel and distributed deep learning: An in-depth concurrency analysis","volume":"52","author":"Ben-Nun Tal","year":"2019","unstructured":"Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Comput. Surv. 52, 4 (2019), 65:1\u201365:43.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_2_61_2","article-title":"Listen and translate: A proof of concept for end-to-end speech-to-text translation","volume":"1612","author":"Berard Alexandre","year":"2016","unstructured":"Alexandre Berard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. CoRR abs\/1612.01744 (2016).","journal-title":"CoRR"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1177\/001316448104100307"},{"key":"e_1_3_2_63_2","first-page":"2722","volume-title":"Proceedings of 2015 IEEE International Conference on Computer Vision, ICCV 2015","author":"Chen Chenyi","year":"2015","unstructured":"Chenyi Chen, Ari Seff, Alain L. Kornhauser, and Jianxiong Xiao. 2015. DeepDriving: Learning affordance for direct perception in autonomous driving. In Proceedings of 2015 IEEE International Conference on Computer Vision, ICCV 2015. 2722\u20132730."},{"key":"e_1_3_2_64_2","first-page":"750","volume-title":"Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2020","author":"Chen Zhenpeng","year":"2020","unstructured":"Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2020. 750\u2013762."},{"key":"e_1_3_2_65_2","first-page":"674","volume-title":"Proceedings of the 43rd International Conference on Software Engineering, ICSE 2021","author":"Chen Zhenpeng","year":"2021","unstructured":"Zhenpeng Chen, Huihan Yao, Yiling Lou, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, and Xuanzhe Liu. 2021. An empirical study on deployment faults of deep learning based mobile applications. In Proceedings of the 43rd International Conference on Software Engineering, ICSE 2021. 674\u2013685."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1177\/001316446002000104"},{"key":"e_1_3_2_67_2","first-page":"1232","volume-title":"Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012","author":"Dean Jeffrey","year":"2012","unstructured":"Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc\u2019Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1232\u20131240."},{"key":"e_1_3_2_68_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171\u20134186."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1214\/009053606000000425"},{"key":"e_1_3_2_70_2","first-page":"469","volume-title":"Proceedings of 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015","author":"Fogel Ari","year":"2015","unstructured":"Ari Fogel, Stanley Fung, Luis Pedrosa, Meg Walraed-Sullivan, Ramesh Govindan, Ratul Mahajan, and Todd D. Millstein. 2015. A general approach to network configuration analysis. In Proceedings of 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015. 469\u2013483."},{"key":"e_1_3_2_71_2","first-page":"509","volume-title":"Proceedings of the 32nd IEEE\/ACM International Conference on Automated Software Engineering, ASE 2017","author":"Franco Anthony Di","year":"2017","unstructured":"Anthony Di Franco, Hui Guo, and Cindy Rubio-Gonz\u00e1lez. 2017. A comprehensive study of real-world numerical bug characteristics. In Proceedings of the 32nd IEEE\/ACM International Conference on Automated Software Engineering, ASE 2017. 509\u2013519."},{"key":"e_1_3_2_72_2","first-page":"539","volume-title":"Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2018","author":"Gao Yu","year":"2018","unstructured":"Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2018. 539\u2013550."},{"key":"e_1_3_2_73_2","volume-title":"A Guide to Chi-Aquared Testing","author":"Greenwood Priscilla E.","year":"1996","unstructured":"Priscilla E. Greenwood and Michael S. Nikulin. 1996. A Guide to Chi-Aquared Testing, Vol. 280."},{"key":"e_1_3_2_74_2","first-page":"266","volume-title":"Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023","author":"Gu Diandian","year":"2023","unstructured":"Diandian Gu, Yihao Zhao, Yinmin Zhong, Yifan Xiong, Zhenhua Han, Peng Cheng, Fan Yang, Gang Huang, Xin Jin, and Xuanzhe Liu. 2023. ElasticFlow: An elastic serverless training platform for distributed deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023. ACM, 266\u2013280."},{"key":"e_1_3_2_75_2","first-page":"620","volume-title":"Proceedings of IEEE International Symposium on High Performance Computer Architecture, HPCA 2018","author":"Hazelwood Kim M.","year":"2018","unstructured":"Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proceedings of IEEE International Symposium on High Performance Computer Architecture, HPCA 2018. 620\u2013629."},{"key":"e_1_3_2_76_2","first-page":"770","volume-title":"Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770\u2013778."},{"key":"e_1_3_2_77_2","first-page":"103","volume-title":"Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 103\u2013112."},{"key":"e_1_3_2_78_2","first-page":"1110","volume-title":"Proceedings of 42nd International Conference on Software Engineering, ICSE 2020","author":"Humbatova Nargiz","year":"2020","unstructured":"Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1110\u20131121."},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338955"},{"key":"e_1_3_2_80_2","first-page":"1135","volume-title":"Proceedings of 42nd International Conference on Software Engineering, ICSE 2020","author":"Islam Md. Johirul","year":"2020","unstructured":"Md. Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: Fix patterns and challenges. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1135\u20131146."},{"key":"e_1_3_2_81_2","first-page":"947","volume-title":"Proceedings of 2019 USENIX Annual Technical Conference, USENIX ATC 2019","author":"Jeon Myeongjae","year":"2019","unstructured":"Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of 2019 USENIX Annual Technical Conference, USENIX ATC 2019. 947\u2013960."},{"key":"e_1_3_2_82_2","volume-title":"Proceedings of Machine Learning and Systems 2019, MLSys 2019","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. In Proceedings of Machine Learning and Systems 2019, MLSys 2019."},{"key":"e_1_3_2_83_2","first-page":"463","volume-title":"Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020","author":"Jiang Yimin","year":"2020","unstructured":"Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU\/CPU clusters. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020. 463\u2013479."},{"key":"e_1_3_2_84_2","article-title":"Cloud programming simplified: A Berkeley view on serverless computing","author":"Jonas Eric","year":"2019","unstructured":"Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, et\u00a0al. 2019. Cloud programming simplified: A Berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).","journal-title":"arXiv preprint arXiv:1902.03383"},{"key":"e_1_3_2_85_2","first-page":"1106","volume-title":"Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012. 1106\u20131114."},{"key":"e_1_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"e_1_3_2_87_2","first-page":"583","volume-title":"Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014","author":"Li Mu","year":"2014","unstructured":"Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014. 583\u2013598."},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.5555\/2486788.2486921"},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-015-5499-z"},{"key":"e_1_3_2_90_2","first-page":"617","volume-title":"Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2020","author":"Lou Yiling","year":"2020","unstructured":"Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Understanding build issue resolution in practice: Symptoms and fix patterns. In Proceedings of 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/SIGSOFT FSE 2020. 617\u2013628."},{"issue":"1","key":"e_1_3_2_91_2","first-page":"3:1\u20133:37","article-title":"Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools","volume":"53","author":"Mayer Ruben","year":"2020","unstructured":"Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv. 53, 1 (2020), 3:1\u20133:37.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_2_92_2","first-page":"1","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019","author":"Narayanan Deepak","year":"2019","unstructured":"Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 1\u201315."},{"key":"e_1_3_2_93_2","doi-asserted-by":"publisher","DOI":"10.1145\/1365490.1365500"},{"key":"e_1_3_2_94_2","first-page":"8024","volume-title":"Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of 32nd Annual Conference on Neural Information Processing Systems, NeurIPS 2019. 8024\u20138035."},{"key":"e_1_3_2_95_2","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1145\/3341301.3359642","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019","author":"Peng Yanghua","year":"2019","unstructured":"Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 16\u201329."},{"key":"e_1_3_2_96_2","article-title":"Free-marginal multirater Kappa (multirater  \\(\\kappa\\) free): An alternative to Fleiss\u2019 Fixed-Marginal Multirater Kappa.","author":"Randolph Justus J.","year":"2005","unstructured":"Justus J. Randolph. 2005. Free-marginal multirater Kappa (multirater \\(\\kappa\\) free): An alternative to Fleiss\u2019 Fixed-Marginal Multirater Kappa. Online Submission (2005).","journal-title":"Online Submission"},{"key":"e_1_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/32.799955"},{"key":"e_1_3_2_98_2","article-title":"Horovod: Fast and easy distributed deep learning in TensorFlow","volume":"1802","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799 (2018).","journal-title":"CoRR"},{"issue":"2","key":"e_1_3_2_99_2","first-page":"30:1\u201330:33","article-title":"A survey on distributed machine learning","volume":"53","author":"Verbraeken Joost","year":"2020","unstructured":"Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A survey on distributed machine learning. ACM Comput. Surv. 53, 2 (2020), 30:1\u201330:33.","journal-title":"ACM Comput. Surv."},{"key":"e_1_3_2_100_2","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1145\/3341301.3359653","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019","author":"Wang Stephanie","year":"2019","unstructured":"Stephanie Wang, John Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, and Ion Stoica. 2019. Lineage stash: Fault tolerance off the critical path. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019. 338\u2013352."},{"key":"e_1_3_2_101_2","first-page":"416","volume-title":"Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/FSE 2021","author":"Wen Jinfeng","year":"2021","unstructured":"Jinfeng Wen, Zhenpeng Chen, Yi Liu, Yiling Lou, Yun Ma, Gang Huang, Xin Jin, and Xuanzhe Liu. 2021. An empirical study on challenges of application development in serverless computing. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC\/FSE 2021. 416\u2013428."},{"key":"e_1_3_2_102_2","volume-title":"Proceedings of 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In Proceedings of 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022."},{"key":"e_1_3_2_103_2","first-page":"1159","volume-title":"Proceedings of 42nd International Conference on Software Engineering, ICSE 2020","author":"Zhang Ru","year":"2020","unstructured":"Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. In Proceedings of 42nd International Conference on Software Engineering, ICSE 2020. 1159\u20131170."},{"key":"e_1_3_2_104_2","first-page":"129","volume-title":"Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018","author":"Zhang Yuhao","year":"2018","unstructured":"Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018. 129\u2013140."},{"key":"e_1_3_2_105_2","first-page":"8","volume-title":"Proceedings of the 2020 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2020","author":"Zhang Zhen","year":"2020","unstructured":"Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, Raman Arora, and Xin Jin. 2020. Is network the bottleneck of distributed training?. In Proceedings of the 2020 Workshop on Network Meets AI & ML, NetAI@SIGCOMM 2020. 8\u201313."},{"key":"e_1_3_2_106_2","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1145\/3544216.3544247","volume-title":"Proceedings of ACM SIGCOMM 2022 Conference.","author":"Zheng Naiqian","year":"2022","unstructured":"Naiqian Zheng, Mengqi Liu, Ennan Zhai, Hongqiang Harry Liu, Yifan Li, Kaicheng Yang, Xuanzhe Liu, and Xin Jin. 2022. Meissa: Scalable network testing for programmable data planes. In Proceedings of ACM SIGCOMM 2022 Conference.350\u2013364."}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3597204","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3597204","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:06Z","timestamp":1750182546000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3597204"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,29]]},"references-count":105,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,11,30]]}},"alternative-id":["10.1145\/3597204"],"URL":"https:\/\/doi.org\/10.1145\/3597204","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,29]]},"assertion":[{"value":"2022-06-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}