{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T16:03:29Z","timestamp":1772726609677,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,12,19]],"date-time":"2022-12-19T00:00:00Z","timestamp":1671408000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,12,19]]},"DOI":"10.1145\/3567955.3567959","type":"proceedings-article","created":{"date-parts":[[2022,12,21]],"date-time":"2022-12-21T18:24:44Z","timestamp":1671647084000},"page":"93-106","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":43,"title":["Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models"],"prefix":"10.1145","author":[{"given":"Shibo","family":"Wang","sequence":"first","affiliation":[{"name":"Google, USA"}]},{"given":"Jinliang","family":"Wei","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Amit","family":"Sabne","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Andy","family":"Davis","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Berkin","family":"Ilbeyi","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Blake","family":"Hechtman","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Dehao","family":"Chen","sequence":"additional","affiliation":[{"name":"Waymo, USA"}]},{"given":"Karthik Srinivasa","family":"Murthy","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Marcello","family":"Maggioni","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Qiao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Sameer","family":"Kumar","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Tongfei","family":"Guo","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Yuanzhong","family":"Xu","sequence":"additional","affiliation":[{"name":"Google, USA"}]},{"given":"Zongwei","family":"Zhou","sequence":"additional","affiliation":[{"name":"Google, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,12,21]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2020. Google breaks AI performance records in MLPerf with world\u2019s fastest training supercomputer. https:\/\/cloud.google.com\/blog\/products\/ai-machine-learning\/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer \t\t\t\t  2020. Google breaks AI performance records in MLPerf with world\u2019s fastest training supercomputer. https:\/\/cloud.google.com\/blog\/products\/ai-machine-learning\/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer"},{"key":"e_1_3_2_1_2_1","unstructured":"2021. MLPerf Training v1.1. https:\/\/mlcommons.org\/en\/training-normal-11\/ \t\t\t\t  2021. MLPerf Training v1.1. https:\/\/mlcommons.org\/en\/training-normal-11\/"},{"key":"e_1_3_2_1_3_1","unstructured":"2021. XLA: Optimizing Compiler for TensorFlow. https:\/\/www.tensorflow.org\/xla \t\t\t\t  2021. XLA: Optimizing Compiler for TensorFlow. https:\/\/www.tensorflow.org\/xla"},{"key":"e_1_3_2_1_4_1","unstructured":"2022. NVIDIA H100 Tensor Core GPU Architecture. https:\/\/www.hpctech.co.jp\/catalog\/gtc22-whitepaper-hopper_v1.01.pdf \t\t\t\t  2022. NVIDIA H100 Tensor Core GPU Architecture. https:\/\/www.hpctech.co.jp\/catalog\/gtc22-whitepaper-hopper_v1.01.pdf"},{"key":"e_1_3_2_1_5_1","unstructured":"2022. XLA DynamicSlice Semantics. https:\/\/www.tensorflow.org\/xla\/operation_semantics##dynamicslice \t\t\t\t  2022. XLA DynamicSlice Semantics. https:\/\/www.tensorflow.org\/xla\/operation_semantics##dynamicslice"},{"key":"e_1_3_2_1_6_1","unstructured":"2022. XLA DynamicUpdateSlice Semantics. https:\/\/www.tensorflow.org\/xla\/operation_semantics##dynamicupdateslice \t\t\t\t  2022. XLA DynamicUpdateSlice Semantics. https:\/\/www.tensorflow.org\/xla\/operation_semantics##dynamicupdateslice"},{"key":"e_1_3_2_1_7_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek G. Murray , Benoit Steiner , Paul Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , and Xiaoqiang Zheng . 2016 . TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) . Savannah, GA. 265\u2013283. Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA. 265\u2013283."},{"key":"e_1_3_2_1_8_1","volume-title":"Le","author":"Adiwardana Daniel","year":"2020","unstructured":"Daniel Adiwardana , Minh-Thang Luong , David R. So , Jamie Hall , Noah Fiedel , Romal Thoppilan , Zi Yang , Apoorv Kulshreshtha , Gaurav Nemade , Yifeng Lu , and Quoc V . Le . 2020 . Towards a Human-like Open-Domain Chatbot. CoRR , abs\/2001.09977 (2020), arXiv:2001.09977. arxiv:2001.09977 Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. CoRR, abs\/2001.09977 (2020), arXiv:2001.09977. arxiv:2001.09977"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.395.0575"},{"key":"e_1_3_2_1_10_1","volume-title":"Shah","author":"Bezanson Jeff","year":"2014","unstructured":"Jeff Bezanson , Alan Edelman , Stefan Karpinski , and Viral B . Shah . 2014 . Julia : A Fresh Approach to Numerical Computing. CoRR , abs\/1411.1607 (2014), arXiv:1411.1607. arxiv:1411.1607 Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2014. Julia: A Fresh Approach to Numerical Computing. CoRR, abs\/1411.1607 (2014), arXiv:1411.1607. arxiv:1411.1607"},{"key":"e_1_3_2_1_11_1","volume-title":"Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang.","author":"Bradbury James","year":"2018","unstructured":"James Bradbury , Roy Frostig , Peter Hawkins , Matthew James Johnson , Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018 . JAX: composable transformations of Python +NumPy programs. http:\/\/github.com\/google\/jax James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http:\/\/github.com\/google\/jax"},{"key":"e_1_3_2_1_12_1","volume-title":"Language Models are Few-Shot Learners. CoRR, abs\/2005.14165","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel M. Ziegler , Jeffrey Wu , Clemens Winter , Christopher Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , and Dario Amodei . 2020. Language Models are Few-Shot Learners. CoRR, abs\/2005.14165 ( 2020 ), arXiv:2005.14165. arxiv:2005.14165 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR, abs\/2005.14165 (2020), arXiv:2005.14165. arxiv:2005.14165"},{"key":"e_1_3_2_1_13_1","unstructured":"Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA. AAI7010025 \t\t\t\t  Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA. AAI7010025"},{"key":"e_1_3_2_1_14_1","volume-title":"TVM: End-to-End Optimization Stack for Deep Learning. abs\/1802.04799","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Haichen Shen , Eddie Q. Yan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: End-to-End Optimization Stack for Deep Learning. abs\/1802.04799 (2018), arXiv:1802.04799. arxiv:1802.04799 Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning. abs\/1802.04799 (2018), arXiv:1802.04799. arxiv:1802.04799"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2005.75"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542275.1542321"},{"key":"e_1_3_2_1_17_1","volume-title":"Proceedings 16th International Parallel and Distributed Processing Symposium. 10 pp\u2013.","author":"Darte A.","unstructured":"A. Darte , D. Chavarria-Miranda , R. Fowler , and J. Mellor-Crummey . 2002. Generalized multipartitioning for multi-dimensional arrays . In Proceedings 16th International Parallel and Distributed Processing Symposium. 10 pp\u2013. A. Darte, D. Chavarria-Miranda, R. Fowler, and J. Mellor-Crummey. 2002. Generalized multipartitioning for multi-dimensional arrays. In Proceedings 16th International Parallel and Distributed Processing Symposium. 10 pp\u2013."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1137\/0210049"},{"key":"e_1_3_2_1_19_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs\/1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs\/1810.04805 (2018), arXiv:1810.04805. arxiv:1810.04805 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs\/1810.04805 (2018), arXiv:1810.04805. arxiv:1810.04805"},{"key":"e_1_3_2_1_20_1","unstructured":"Nan Du Yanping Huang Andrew M. Dai Simon Tong Dmitry Lepikhin Yuanzhong Xu Maxim Krikun Yanqi Zhou Adams Wei Yu Orhan Firat Barret Zoph Liam Fedus Maarten Bosma Zongwei Zhou Tao Wang Yu Emma Wang Kellie Webster Marie Pellat Kevin Robinson Kathy Meier-Hellstern Toju Duke Lucas Dixon Kun Zhang Quoc V Le Yonghui Wu Zhifeng Chen and Claire Cui. 2021. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. arxiv:2112.06905. \t\t\t\t  Nan Du Yanping Huang Andrew M. Dai Simon Tong Dmitry Lepikhin Yuanzhong Xu Maxim Krikun Yanqi Zhou Adams Wei Yu Orhan Firat Barret Zoph Liam Fedus Maarten Bosma Zongwei Zhou Tao Wang Yu Emma Wang Kellie Webster Marie Pellat Kevin Robinson Kathy Meier-Hellstern Toju Duke Lucas Dixon Kun Zhang Quoc V Le Yonghui Wu Zhifeng Chen and Claire Cui. 2021. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. arxiv:2112.06905."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/898758"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389132"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2016.62"},{"key":"e_1_3_2_1_24_1","volume-title":"GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR, abs\/1811.06965","author":"Huang Yanping","year":"2018","unstructured":"Yanping Huang , Yonglong Cheng , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V. Le , and Zhifeng Chen . 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR, abs\/1811.06965 ( 2018 ), arXiv:1811.06965. arxiv:1811.06965 Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR, abs\/1811.06965 (2018), arXiv:1811.06965. arxiv:1811.06965"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007554715418"},{"key":"e_1_3_2_1_26_1","volume-title":"George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson.","author":"Jouppi Norman P.","year":"2020","unstructured":"Norman P. Jouppi , Doe Hyun Yoon , George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020 . A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM , 63, 7 (2020), jun, 67\u201378. Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM, 63, 7 (2020), jun, 67\u201378."},{"key":"e_1_3_2_1_27_1","volume-title":"Scaling Laws for Neural Language Models. CoRR, abs\/2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , and Dario Amodei . 2020. Scaling Laws for Neural Language Models. CoRR, abs\/2001.08361 ( 2020 ), arXiv:2001.08361. arxiv:2001.08361 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR, abs\/2001.08361 (2020), arXiv:2001.08361. arxiv:2001.08361"},{"key":"e_1_3_2_1_28_1","volume-title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR, abs\/1609.04836","author":"Keskar Nitish Shirish","year":"2016","unstructured":"Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , and Ping Tak Peter Tang . 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR, abs\/1609.04836 ( 2016 ), arXiv:1609.04836. arxiv:1609.04836 Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR, abs\/1609.04836 (2016), arXiv:1609.04836. arxiv:1609.04836"},{"key":"e_1_3_2_1_29_1","volume-title":"Pipelined Backpropagation at Scale: Training Large Models without Batches. CoRR, abs\/2003.11666","author":"Kosson Atli","year":"2020","unstructured":"Atli Kosson , Vitaliy Chiley , Abhinav Venigalla , Joel Hestness , and Urs K\u00f6ster . 2020. Pipelined Backpropagation at Scale: Training Large Models without Batches. CoRR, abs\/2003.11666 ( 2020 ), arXiv:2003.11666. arxiv:2003.11666 Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, and Urs K\u00f6ster. 2020. Pipelined Backpropagation at Scale: Training Large Models without Batches. CoRR, abs\/2003.11666 (2020), arXiv:2003.11666. arxiv:2003.11666"},{"key":"e_1_3_2_1_30_1","volume-title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. CoRR, abs\/2006.16668","author":"Lepikhin Dmitry","year":"2020","unstructured":"Dmitry Lepikhin , HyoukJoong Lee , Yuanzhong Xu , Dehao Chen , Orhan Firat , Yanping Huang , Maxim Krikun , Noam Shazeer , and Zhifeng Chen . 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. CoRR, abs\/2006.16668 ( 2020 ), arXiv:2006.16668. arxiv:2006.16668 Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. CoRR, abs\/2006.16668 (2020), arXiv:2006.16668. arxiv:2006.16668"},{"key":"e_1_3_2_1_31_1","unstructured":"Nilesh Mahajan Sajith Sasidharan Arun Chauhan and Andrew Lumsdaine. 2012. Automatically Generating Coarse Grained Software Pipelining from Declaratively Specified Communication. 05. \t\t\t\t  Nilesh Mahajan Sajith Sasidharan Arun Chauhan and Andrew Lumsdaine. 2012. Automatically Generating Coarse Grained Software Pipelining from Declaratively Specified Communication. 05."},{"key":"e_1_3_2_1_32_1","unstructured":"Dheevatsa Mudigere Yuchen Hao Jianyu Huang Andrew Tulloch Srinivas Sridharan Xing Liu Mustafa Ozdal Jade Nie Jongsoo Park Liang Luo Jie Amy Yang Leon Gao Dmytro Ivchenko Aarti Basant Yuxi Hu Jiyan Yang Ehsan K. Ardestani Xiaodong Wang Rakesh Komuravelli Ching-Hsiang Chu Serhat Yilmaz Huayu Li Jiyuan Qian Zhuobo Feng Yinbin Ma Junjie Yang Ellie Wen Hong Li Lin Yang Chonglin Sun Whitney Zhao Dimitry Melts Krishna Dhulipala K. R. Kishore Tyler Graf Assaf Eisenman Kiran Kumar Matam Adi Gangidi Guoqiang Jerry Chen Manoj Krishnan Avinash Nayak Krishnakumar Nair Bharath Muthiah Mahmoud khorashadi Pallab Bhattacharya Petr Lapukhov Maxim Naumov Lin Qiao Mikhail Smelyanskiy Bill Jia and Vijay Rao. 2021. High-performance Distributed Training of Large-scale Deep Learning Recommendation Models. CoRR abs\/2104.05158 (2021) arXiv:2104.05158. arxiv:2104.05158 \t\t\t\t  Dheevatsa Mudigere Yuchen Hao Jianyu Huang Andrew Tulloch Srinivas Sridharan Xing Liu Mustafa Ozdal Jade Nie Jongsoo Park Liang Luo Jie Amy Yang Leon Gao Dmytro Ivchenko Aarti Basant Yuxi Hu Jiyan Yang Ehsan K. Ardestani Xiaodong Wang Rakesh Komuravelli Ching-Hsiang Chu Serhat Yilmaz Huayu Li Jiyuan Qian Zhuobo Feng Yinbin Ma Junjie Yang Ellie Wen Hong Li Lin Yang Chonglin Sun Whitney Zhao Dimitry Melts Krishna Dhulipala K. R. Kishore Tyler Graf Assaf Eisenman Kiran Kumar Matam Adi Gangidi Guoqiang Jerry Chen Manoj Krishnan Avinash Nayak Krishnakumar Nair Bharath Muthiah Mahmoud khorashadi Pallab Bhattacharya Petr Lapukhov Maxim Naumov Lin Qiao Mikhail Smelyanskiy Bill Jia and Vijay Rao. 2021. High-performance Distributed Training of Large-scale Deep Learning Recommendation Models. CoRR abs\/2104.05158 (2021) arXiv:2104.05158. arxiv:2104.05158"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_34_1","volume-title":"Memory-Efficient Pipeline-Parallel DNN Training. CoRR, abs\/2006.09503","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan , Amar Phanishayee , Kaiyu Shi , Xie Chen , and Matei Zaharia . 2020. Memory-Efficient Pipeline-Parallel DNN Training. CoRR, abs\/2006.09503 ( 2020 ), arXiv:2006.09503. arxiv:2006.09503 Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-Efficient Pipeline-Parallel DNN Training. CoRR, abs\/2006.09503 (2020), arXiv:2006.09503. arxiv:2006.09503"},{"key":"e_1_3_2_1_35_1","volume-title":"Efficient Large-Scale Language Model Training on GPU Clusters. CoRR, abs\/2104.04473","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , and Matei Zaharia . 2021. Efficient Large-Scale Language Model Training on GPU Clusters. CoRR, abs\/2104.04473 ( 2021 ), arXiv:2104.04473. arxiv:2104.04473 Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters. CoRR, abs\/2104.04473 (2021), arXiv:2104.04473. arxiv:2104.04473"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3058217"},{"key":"e_1_3_2_1_37_1","unstructured":"Ali Alvi Paresh Kharya. 2021. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B the World\u2019s Largest and Most Powerful Generative Language Model. https:\/\/developer.nvidia.com\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/ \t\t\t\t  Ali Alvi Paresh Kharya. 2021. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B the World\u2019s Largest and Most Powerful Generative Language Model. https:\/\/developer.nvidia.com\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/"},{"key":"e_1_3_2_1_38_1","volume-title":"High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas K\u00f6pf , Edward Z. Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019 . PyTorch: An Imperative Style , High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 , NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 8024\u20138035. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 8024\u20138035."},{"key":"e_1_3_2_1_39_1","volume-title":"Carbon Emissions and Large Neural Network Training. CoRR, abs\/2104.10350","author":"Patterson David A.","year":"2021","unstructured":"David A. Patterson , Joseph Gonzalez , Quoc V. Le , Chen Liang , Lluis-Miquel Munguia , Daniel Rothchild , David R. So , Maud Texier , and Jeff Dean . 2021. Carbon Emissions and Large Neural Network Training. CoRR, abs\/2104.10350 ( 2021 ), arXiv:2104.10350. arxiv:2104.10350 David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training. CoRR, abs\/2104.10350 (2021), arXiv:2104.10350. arxiv:2104.10350"},{"key":"e_1_3_2_1_40_1","volume-title":"Recent Advances in the Message Passing Interface","author":"Pellegrini Simone","unstructured":"Simone Pellegrini , Torsten Hoefler , and Thomas Fahringer . 2012. Exact Dependence Analysis for Increased Communication Overlap . In Recent Advances in the Message Passing Interface . Springer Berlin Heidelberg , Berlin, Heidelberg . 89\u201399. isbn:978-3-642-33518-1 Simone Pellegrini, Torsten Hoefler, and Thomas Fahringer. 2012. Exact Dependence Analysis for Increased Communication Overlap. In Recent Advances in the Message Passing Interface. Springer Berlin Heidelberg, Berlin, Heidelberg. 89\u201399. isbn:978-3-642-33518-1"},{"key":"e_1_3_2_1_41_1","volume-title":"Liu","author":"Raffel Colin","year":"2019","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J . Liu . 2019 . Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR , abs\/1910.10683 (2019), arXiv:1910.10683. arxiv:1910.10683 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR, abs\/1910.10683 (2019), arXiv:1910.10683. arxiv:1910.10683"},{"key":"e_1_3_2_1_42_1","unstructured":"Aditya Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv:2102.12092 \t\t\t\t  Aditya Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv:2102.12092"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00049"},{"key":"e_1_3_2_1_44_1","volume-title":"Glow: Graph Lowering Compiler Techniques for Neural Networks. abs\/1805.00907","author":"Rotem Nadav","year":"2018","unstructured":"Nadav Rotem , Jordan Fix , Saleem Abdulrasool , Summer Deng , Roman Dzhabarov , James Hegeman , Roman Levenstein , Bert Maher , Nadathur Satish , Jakob Olesen , Jongsoo Park , Artem Rakhov , and Misha Smelyanskiy . 2018 . Glow: Graph Lowering Compiler Techniques for Neural Networks. abs\/1805.00907 (2018), arXiv:1805.00907. arxiv:1805.00907 Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. abs\/1805.00907 (2018), arXiv:1805.00907. arxiv:1805.00907"},{"key":"e_1_3_2_1_45_1","volume-title":"Hechtman","author":"Shazeer Noam","year":"2018","unstructured":"Noam Shazeer , Youlong Cheng , Niki Parmar , Dustin Tran , Ashish Vaswani , Penporn Koanantakool , Peter Hawkins , HyoukJoong Lee , Mingsheng Hong , Cliff Young , Ryan Sepassi , and Blake A . Hechtman . 2018 . Mesh-TensorFlow: Deep Learning for Supercomputers. CoRR , abs\/1811.02084 (2018), arXiv:1811.02084. arxiv:1811.02084 Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. CoRR, abs\/1811.02084 (2018), arXiv:1811.02084. arxiv:1811.02084"},{"key":"e_1_3_2_1_46_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053 ( 2019 ), arXiv:1909.08053. arxiv:1909.08053 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs\/1909.08053 (2019), arXiv:1909.08053. arxiv:1909.08053"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5555\/2033408.2033420"},{"key":"e_1_3_2_1_48_1","volume-title":"van de Geijn and Jerrell Watts","author":"Robert","year":"1995","unstructured":"Robert A. van de Geijn and Jerrell Watts . 1995 . SUMMA : Scalable Universal Matrix Multiplication Algorithm. USA. Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. USA."},{"key":"e_1_3_2_1_49_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141 ukasz Kaiser, and Illia Polosukhin . 2017 . Attention is All you Need. In Advances in Neural Information Processing Systems . 30, https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 30, https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_1_50_1","volume-title":"abs\/2105.14500","author":"Wang Boxiang","year":"2021","unstructured":"Boxiang Wang , Qifan Xu , Zhengda Bian , and Yang You . 2021. 2.5-dimensional distributed model training. CoRR , abs\/2105.14500 ( 2021 ), arXiv:2105.14500. arxiv:2105.14500 Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2021. 2.5-dimensional distributed model training. CoRR, abs\/2105.14500 (2021), arXiv:2105.14500. arxiv:2105.14500"},{"key":"e_1_3_2_1_51_1","volume-title":"The Free Encyclopedia","unstructured":"Wikipedia. 2022. Einstein notation \u2014 Wikipedia , The Free Encyclopedia . http:\/\/en.wikipedia.org\/w\/index.php?title=Einstein%20notation&oldid=1083457917 [Online; accessed 21-June-2022] Wikipedia. 2022. Einstein notation \u2014 Wikipedia, The Free Encyclopedia. http:\/\/en.wikipedia.org\/w\/index.php?title=Einstein%20notation&oldid=1083457917 [Online; accessed 21-June-2022]"},{"key":"e_1_3_2_1_52_1","volume-title":"GSPMD: General and Scalable Parallelization for ML Computation Graphs. arxiv:2105.04663.","author":"Xu Yuanzhong","year":"2021","unstructured":"Yuanzhong Xu , HyoukJoong Lee , Dehao Chen , Blake Hechtman , Yanping Huang , Rahul Joshi , Maxim Krikun , Dmitry Lepikhin , Andy Ly , Marcello Maggioni , Ruoming Pang , Noam Shazeer , Shibo Wang , Tao Wang , Yonghui Wu , and Zhifeng Chen . 2021 . GSPMD: General and Scalable Parallelization for ML Computation Graphs. arxiv:2105.04663. Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. 2021. GSPMD: General and Scalable Parallelization for ML Computation Graphs. arxiv:2105.04663."},{"key":"e_1_3_2_1_53_1","volume-title":"PipeMare: Asynchronous Pipeline Parallel DNN Training. CoRR, abs\/1910.05124","author":"Yang Bowen","year":"2019","unstructured":"Bowen Yang , Jian Zhang , Jonathan Li , Christopher R\u00e9 , Christopher R. Aberger , and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. CoRR, abs\/1910.05124 ( 2019 ), arXiv:1910.05124. arxiv:1910.05124 Bowen Yang, Jian Zhang, Jonathan Li, Christopher R\u00e9, Christopher R. Aberger, and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. CoRR, abs\/1910.05124 (2019), arXiv:1910.05124. arxiv:1910.05124"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Xiaohua Zhai Alexander Kolesnikov Neil Houlsby and Lucas Beyer. 2021. Scaling Vision Transformers. arxiv:2106.04560. \t\t\t\t  Xiaohua Zhai Alexander Kolesnikov Neil Houlsby and Lucas Beyer. 2021. Scaling Vision Transformers. arxiv:2106.04560.","DOI":"10.1109\/CVPR52688.2022.01179"},{"key":"e_1_3_2_1_55_1","volume-title":"Bhuvana Ramabhadran, Tara N. Sainath, Fran\u00e7oise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu.","author":"Zhang Yu","year":"2021","unstructured":"Yu Zhang , Daniel S. Park , Wei Han , James Qin , Anmol Gulati , Joel Shor , Aren Jansen , Yuanzhong Xu , Yanping Huang , Shibo Wang , Zongwei Zhou , Bo Li , Min Ma , William Chan , Jiahui Yu , Yongqiang Wang , Liangliang Cao , Khe Chai Sim , Bhuvana Ramabhadran, Tara N. Sainath, Fran\u00e7oise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu. 2021 . BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition . arxiv:2109.13226. Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Fran\u00e7oise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu. 2021. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. arxiv:2109.13226."}],"event":{"name":"ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1","location":"Vancouver BC Canada","acronym":"ASPLOS '23","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3567955.3567959","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3567955.3567959","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:26:14Z","timestamp":1750281974000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3567955.3567959"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,19]]},"references-count":55,"alternative-id":["10.1145\/3567955.3567959","10.1145\/3567955"],"URL":"https:\/\/doi.org\/10.1145\/3567955.3567959","relation":{},"subject":[],"published":{"date-parts":[[2022,12,19]]},"assertion":[{"value":"2022-12-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}