{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T01:51:20Z","timestamp":1773193880452,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":53,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,2,22]],"date-time":"2022-02-22T00:00:00Z","timestamp":1645488000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-2052696"],"award-info":[{"award-number":["CCF-2052696"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,2,28]]},"DOI":"10.1145\/3503222.3507778","type":"proceedings-article","created":{"date-parts":[[2022,2,22]],"date-time":"2022-02-22T20:49:01Z","timestamp":1645562941000},"page":"402-416","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":48,"title":["Breaking the computation and communication abstraction barrier in distributed machine learning workloads"],"prefix":"10.1145","author":[{"given":"Abhinav","family":"Jangda","sequence":"first","affiliation":[{"name":"University of Massachusetts at Amherst, USA"}]},{"given":"Jun","family":"Huang","sequence":"additional","affiliation":[{"name":"Ohio State University, USA"}]},{"given":"Guodong","family":"Liu","sequence":"additional","affiliation":[{"name":"Chinese Academy of Sciences, China"}]},{"given":"Amir Hossein Nodehi","family":"Sabet","sequence":"additional","affiliation":[{"name":"University of California at Riverside, USA"}]},{"given":"Saeed","family":"Maleki","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2395-9965","authenticated-orcid":false,"given":"Youshan","family":"Miao","sequence":"additional","affiliation":[{"name":"Microsoft Research, China"}]},{"given":"Madanlal","family":"Musuvathi","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Todd","family":"Mytkowicz","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]},{"given":"Olli","family":"Saarikivi","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,2,22]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Accessed: 2022-01-12. Apache mxnet. https:\/\/mxnet.apache.org\/"},{"key":"e_1_3_2_1_2_1","unstructured":"Accessed: 2022-01-12. cuBLAS. https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html"},{"key":"e_1_3_2_1_3_1","unstructured":"Accessed: 2022-01-12. cuDNN. https:\/\/docs.nvidia.com\/cuda\/cudnn\/index.html"},{"key":"e_1_3_2_1_4_1","unstructured":"Accessed: 2022-01-12. CUTLASS. https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_2_1_5_1","unstructured":"Accessed: 2022-01-12. GPUDirect RDMA. https:\/\/docs.nvidia.com\/cuda\/gpudirect-rdma\/index.html"},{"key":"e_1_3_2_1_6_1","unstructured":"Accessed: 2022-01-12. NVIDIA Apex. https:\/\/github.com\/NVIDIA\/apex"},{"key":"e_1_3_2_1_7_1","unstructured":"Accessed: 2022-01-12. NVIDIA BERT. https:\/\/github.com\/NVIDIA\/DeepLearningExamples"},{"key":"e_1_3_2_1_8_1","volume-title":"2022-01-12","author":"Accessed","unstructured":"Accessed: 2022-01-12. NVIDIA Collective Communication Library. https:\/\/github.com\/NVIDIA\/nccl"},{"key":"e_1_3_2_1_9_1","unstructured":"Accessed: 2022-01-12. NVIDIA Megatron-LM. https:\/\/github.com\/NVIDIA\/Megatron-LM\/"},{"key":"e_1_3_2_1_10_1","unstructured":"Accessed: 2022-01-12. OpenAI\u2019s GPT-3 Language Model: A Technical Overview. https:\/\/lambdalabs.com\/blog\/demystifying-gpt-3\/"},{"key":"e_1_3_2_1_11_1","unstructured":"Accessed: 2022-01-12. Parameter fusion in optimizer partition makes LAMB behaves differently. https:\/\/github.com\/microsoft\/DeepSpeed\/issues\/490"},{"key":"e_1_3_2_1_12_1","unstructured":"Accessed: 2022-01-12. Training with Mixed Precision. https:\/\/docs.nvidia.com\/deeplearning\/performance\/mixed-precision-training\/index.html"},{"key":"e_1_3_2_1_13_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0477-7"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2013.6799131"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503289"},{"key":"e_1_3_2_1_17_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation.","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2016.37"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851157"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_3_2_1_22_1","volume-title":"MPI: A Message-Passing Interface Standard Version 3.0.","author":"Interface Forum Message Passing","year":"2012","unstructured":"Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.51"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414632"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3168824"},{"key":"e_1_3_2_1_26_1","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems 32."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414649"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.6084\/m9.figshare.18480953"},{"key":"e_1_3_2_1_29_1","volume-title":"Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093.","author":"Jia Yangqing","year":"2014","unstructured":"Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093."},{"key":"e_1_3_2_1_30_1","volume-title":"Proceedings of Machine Learning and Systems.","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems."},{"key":"e_1_3_2_1_31_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation. https:\/\/www.usenix.org\/system\/files\/osdi20-jiang.pdf","author":"Jiang Yimin","year":"2020","unstructured":"Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU\/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation. https:\/\/www.usenix.org\/system\/files\/osdi20-jiang.pdf"},{"key":"e_1_3_2_1_32_1","volume-title":"Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR","author":"Diederik","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015. arxiv:1412.6980"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0743-7315(03)00102-3"},{"key":"e_1_3_2_1_34_1","volume-title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.","author":"Lepikhin Dmitry","year":"2021","unstructured":"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","unstructured":"H. Lu S. Seo and P. Balaji. 2015. MPI+ULT: Overlapping Communication and Computation with User-Level Threads. In 2015 IEEE 17th International Conference on High Performance Computing and Communications 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems. https:\/\/doi.org\/10.1109\/HPCC-CSS-ICESS.2015.82 10.1109\/HPCC-CSS-ICESS.2015.82","DOI":"10.1109\/HPCC-CSS-ICESS.2015.82"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810091"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_40_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32."},{"key":"e_1_3_2_1_41_1","unstructured":"Alec Radford Jeff Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","author":"Rajbhandari Samyam","year":"2020","unstructured":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2778161"},{"key":"e_1_3_2_1_45_1","unstructured":"Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arxiv:1802.05799."},{"key":"e_1_3_2_1_46_1","unstructured":"Noam Shazeer Youlong Cheng Niki Parmar Dustin Tran Ashish Vaswani Penporn Koanantakool Peter Hawkins HyoukJoong Lee Mingsheng Hong Cliff Young Ryan Sepassi and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_1_47_1","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv:1909.08053."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2017.7863730"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Emma Strubell Ananya Ganesh and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. arxiv:1906.02243.","DOI":"10.18653\/v1\/P19-1355"},{"key":"e_1_3_2_1_50_1","volume-title":"Panda","author":"Subramoni Hari","year":"2017","unstructured":"Hari Subramoni, Sourav Chakraborty, and Dhabaleswar K. Panda. 2017. Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication. In High Performance Computing."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00089"},{"key":"e_1_3_2_1_52_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH","author":"You Yang","year":"2020","unstructured":"Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Syx4wnEtvH"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3094364"}],"event":{"name":"ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Lausanne Switzerland","acronym":"ASPLOS '22","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGOPS ACM Special Interest Group on Operating Systems","SIGARCH ACM Special Interest Group on Computer Architecture","SIGBED ACM Special Interest Group on Embedded Systems"]},"container-title":["Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503222.3507778","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503222.3507778","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503222.3507778","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:50Z","timestamp":1750188650000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503222.3507778"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,22]]},"references-count":53,"alternative-id":["10.1145\/3503222.3507778","10.1145\/3503222"],"URL":"https:\/\/doi.org\/10.1145\/3503222.3507778","relation":{},"subject":[],"published":{"date-parts":[[2022,2,22]]},"assertion":[{"value":"2022-02-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}