{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T01:13:06Z","timestamp":1780708386269,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":52,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T00:00:00Z","timestamp":1713744000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,4,22]]},"DOI":"10.1145\/3627703.3629554","type":"proceedings-article","created":{"date-parts":[[2024,4,18]],"date-time":"2024-04-18T06:28:28Z","timestamp":1713421708000},"page":"163-181","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0335-9628","authenticated-orcid":false,"given":"Guodong","family":"Liu","sequence":"first","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, CAS, Univ. of Chinese Academy of Sciences, and Microsoft Research (Asia)"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2395-9965","authenticated-orcid":false,"given":"Youshan","family":"Miao","sequence":"additional","affiliation":[{"name":"Microsoft Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8050-6196","authenticated-orcid":false,"given":"Zhiqi","family":"Lin","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6840-4691","authenticated-orcid":false,"given":"Xiaoxiang","family":"Shi","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7998-3681","authenticated-orcid":false,"given":"Saeed","family":"Maleki","sequence":"additional","affiliation":[{"name":"Microsoft Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0378-060X","authenticated-orcid":false,"given":"Fan","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6565-5276","authenticated-orcid":false,"given":"Yungang","family":"Bao","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, CAS, Univ. of Chinese Academy of Sciences"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9629-0860","authenticated-orcid":false,"given":"Sa","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, CAS, Univ. of Chinese Academy of Sciences"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,4,22]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519584"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2877890"},{"key":"e_1_3_2_1_3_1","unstructured":"Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465282"},{"key":"e_1_3_2_1_5_1","volume-title":"Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_3_2_1_6_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Marc'aurelio Ranzato Andrew Senior Paul Tucker Ke Yang et al. 2012. Large scale distributed deep networks. Advances in Neural Information Processing Systems 25 (2012)."},{"key":"e_1_3_2_1_7_1","volume-title":"BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_8_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Eliad Saar","year":"2021","unstructured":"Saar Eliad, Ido Hakimi, Alon De Jagger, Mark Silberstein, and Assaf Schuster. 2021. Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 381--396."},{"key":"e_1_3_2_1_9_1","volume-title":"Retrieved","year":"2022","unstructured":"Facebook. 2022. PyTorch. Retrieved May, 2022 from https:\/\/pytorch.org\/"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_1_11_1","volume-title":"Retrieved","year":"2022","unstructured":"Google. 2022. XLA: Optimizing Compiler for Machine Learning. Retrieved May, 2022 from https:\/\/www.tensorflow.org\/xla"},{"key":"e_1_3_2_1_12_1","volume-title":"Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems 29","author":"Gruslys Audrunas","year":"2016","unstructured":"Audrunas Gruslys, R\u00e9mi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems 29 (2016)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_14_1","first-page":"623","article-title":"dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training","volume":"4","author":"Hu Hanpeng","year":"2022","unstructured":"Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training. Proceedings of Machine Learning and Systems 4 (2022), 623--637.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_15_1","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu et al. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103--112."},{"key":"e_1_3_2_1_16_1","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497--511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_17_1","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Jia Xianyan","year":"2022","unstructured":"Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, et al. 2022. Whale: Efficient Giant Model Training over Heterogeneous GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 673--688."},{"key":"e_1_3_2_1_18_1","first-page":"1","article-title":"Beyond Data and Model Parallelism for Deep Neural Networks","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1--13.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2019.10.012"},{"key":"e_1_3_2_1_20_1","volume-title":"Dynamic Tensor Rematerialization. In International Conference on Learning Representations.","author":"Kirisame Marisa","year":"2020","unstructured":"Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_21_1","volume-title":"Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198 (2022)."},{"key":"e_1_3_2_1_22_1","volume-title":"Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems 32","author":"Kumar Ravi","year":"2019","unstructured":"Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. 2019. Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_3_2_1_23_1","volume-title":"International Conference on Machine Learning. PMLR, 6437--6449","author":"Li Guohao","year":"2021","unstructured":"Guohao Li, Matthias M\u00fcller, Bernard Ghanem, and Vladlen Koltun. 2021. Training graph neural networks with 1000 layers. In International Conference on Machine Learning. PMLR, 6437--6449."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Ze Liu Han Hu Yutong Lin Zhuliang Yao Zhenda Xie Yixuan Wei Jia Ning Yue Cao Zheng Zhang Li Dong et al. 2021. Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv preprint arXiv:2111.09883 (2021).","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_1_27_1","volume-title":"International Conference on Machine Learning. PMLR, 2430--2439","author":"Mirhoseini Azalia","year":"2017","unstructured":"Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In International Conference on Machine Learning. PMLR, 2430--2439."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2009.28"},{"key":"e_1_3_2_1_31_1","unstructured":"NVIDIA. 2020. DGX-1. Retrieved Aug. 2022 from https:\/\/www.nvidia.com\/en-gb\/data-center\/dgx-systems\/dgx-1\/"},{"key":"e_1_3_2_1_32_1","volume-title":"CUDA C Programming Guide. Retrieved","author":"NVIDIA.","year":"2022","unstructured":"NVIDIA. 2022. CUDA C Programming Guide. Retrieved Aug. 2022 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/"},{"key":"e_1_3_2_1_33_1","volume-title":"12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)","author":"Ousterhout Kay","year":"2015","unstructured":"Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making sense of performance in data analytics frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). 293--307."},{"key":"e_1_3_2_1_34_1","volume-title":"Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350","author":"Patterson David","year":"2021","unstructured":"David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350 (2021)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2814576.2814808"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCCN.2018.8487324"},{"key":"e_1_3_2_1_38_1","volume-title":"Improving language understanding by generative pre-training. arXiv preprint arXiv:1704.01444","author":"Radford Alec","year":"2017","unstructured":"Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. arXiv preprint arXiv:1704.01444, 2017 (2018)."},{"key":"e_1_3_2_1_39_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_1_40_1","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020), 1--67.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553486"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_43_1","volume-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_44_1","first-page":"24829","article-title":"Piper: Multidimensional planner for DNN parallelization","volume":"34","author":"Tarnawski Jakub M","year":"2021","unstructured":"Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional planner for DNN parallelization. Advances in Neural Information Processing Systems 34 (2021), 24829--24840.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_45_1","first-page":"15451","article-title":"Efficient algorithms for device placement of DNN graph operators","volume":"33","author":"Tarnawski Jakub M","year":"2020","unstructured":"Jakub M Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, and Fanny Nina Paravecino. 2020. Efficient algorithms for device placement of DNN graph operators. Advances in Neural Information Processing Systems 33 (2020), 15451--15463.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_46_1","unstructured":"Uber. 2022. Horovod. Retrieved Aug. 2022 from https:\/\/horovod.ai\/"},{"key":"e_1_3_2_1_47_1","volume-title":"Deepnet: Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555","author":"Wang Hongyu","year":"2022","unstructured":"Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. 2022. Deepnet: Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555 (2022)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_3_2_1_49_1","volume-title":"GSPMD: General and Scalable Parallelization for ML Computation Graphs. arXiv preprint arXiv:2105.04663","author":"Xu Yuanzhong","year":"2021","unstructured":"Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. 2021. GSPMD: General and Scalable Parallelization for ML Computation Graphs. arXiv preprint arXiv:2105.04663 (2021)."},{"key":"e_1_3_2_1_50_1","volume-title":"Wide residual networks. arXiv preprint arXiv:1605.07146","author":"Zagoruyko Sergey","year":"2016","unstructured":"Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)."},{"key":"e_1_3_2_1_51_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and intra-Operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578."},{"key":"e_1_3_2_1_52_1","volume-title":"Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Zhu Hongyu","year":"2020","unstructured":"Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 337--352."}],"event":{"name":"EuroSys '24: Nineteenth European Conference on Computer Systems","location":"Athens Greece","acronym":"EuroSys '24","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the Nineteenth European Conference on Computer Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3627703.3629554","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3627703.3629554","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T01:12:38Z","timestamp":1755825158000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3627703.3629554"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,22]]},"references-count":52,"alternative-id":["10.1145\/3627703.3629554","10.1145\/3627703"],"URL":"https:\/\/doi.org\/10.1145\/3627703.3629554","relation":{},"subject":[],"published":{"date-parts":[[2024,4,22]]},"assertion":[{"value":"2024-04-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}