{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:37:45Z","timestamp":1772725065745,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":89,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T00:00:00Z","timestamp":1674777600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,1,27]]},"DOI":"10.1145\/3575693.3575712","type":"proceedings-article","created":{"date-parts":[[2023,1,30]],"date-time":"2023-01-30T22:56:55Z","timestamp":1675119415000},"page":"560-573","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression"],"prefix":"10.1145","author":[{"given":"Jaeyong","family":"Song","sequence":"first","affiliation":[{"name":"Yonsei University, South Korea"}]},{"given":"Jinkyu","family":"Yim","sequence":"additional","affiliation":[{"name":"Seoul National University, South Korea"}]},{"given":"Jaewon","family":"Jung","sequence":"additional","affiliation":[{"name":"Yonsei University, South Korea"}]},{"given":"Hongsun","family":"Jang","sequence":"additional","affiliation":[{"name":"Seoul National University, South Korea"}]},{"given":"Hyung-Jin","family":"Kim","sequence":"additional","affiliation":[{"name":"Samsung Electronics, South Korea"}]},{"given":"Youngsok","family":"Kim","sequence":"additional","affiliation":[{"name":"Yonsei University, South Korea"}]},{"given":"Jinho","family":"Lee","sequence":"additional","affiliation":[{"name":"Seoul National University, South Korea"}]}],"member":"320","published-online":{"date-parts":[[2023,1,30]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Saurabh Agarwal Hongyi Wang Shivaram Venkataraman and Dimitris Papailiopoulos. 2022. On the Utility of Gradient Compression in Distributed Training Systems. In MLSys. Saurabh Agarwal Hongyi Wang Shivaram Venkataraman and Dimitris Papailiopoulos. 2022. On the Utility of Gradient Compression in Distributed Training Systems. In MLSys."},{"key":"e_1_3_2_1_2_1","unstructured":"Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training Resnet-50 on Imagenet in 15 Minutes. arXiv preprint arXiv:1711.04325. Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training Resnet-50 on Imagenet in 15 Minutes. arXiv preprint arXiv:1711.04325."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Aida Amini Saadia Gabriel Shanchuan Lin Rik Koncel-Kedziorski Yejin Choi and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In NAACL. 2357\u20132367. Aida Amini Saadia Gabriel Shanchuan Lin Rik Koncel-Kedziorski Yejin Choi and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In NAACL. 2357\u20132367.","DOI":"10.18653\/v1\/N19-1245"},{"key":"e_1_3_2_1_4_1","volume-title":"Can Karakus, and Suhas Diggavi.","author":"Basu Debraj","year":"2019","unstructured":"Debraj Basu , Deepesh Data , Can Karakus, and Suhas Diggavi. 2019 . Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations. In NeurIPS. Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. 2019. Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations. In NeurIPS."},{"key":"e_1_3_2_1_5_1","unstructured":"Jeremy Bernstein Yu-Xiang Wang Kamyar Azizzadenesheli and Anima Anandkumar. 2018. signSGD: compressed optimisation for non-convex problems. In ICML. Jeremy Bernstein Yu-Xiang Wang Kamyar Azizzadenesheli and Anima Anandkumar. 2018. signSGD: compressed optimisation for non-convex problems. In ICML."},{"key":"e_1_3_2_1_6_1","unstructured":"Jeremy Bernstein Jiawei Zhao Kamyar Azizzadenesheli and Anima Anandkumar. 2018. SignSGD with Majority Vote Is Communication Efficient and Fault Tolerant. arXiv preprint arXiv:1810.05291. Jeremy Bernstein Jiawei Zhao Kamyar Azizzadenesheli and Anima Anandkumar. 2018. SignSGD with Majority Vote Is Communication Efficient and Fault Tolerant. arXiv preprint arXiv:1810.05291."},{"key":"e_1_3_2_1_7_1","volume-title":"Jianfeng Gao, and Yejin Choi.","author":"Bisk Yonatan","year":"2020","unstructured":"Yonatan Bisk , Rowan Zellers , Ronan Le bras , Jianfeng Gao, and Yejin Choi. 2020 . PIQA : Reasoning about Physical Commonsense in Natural Language. In AAAI. Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. In AAAI."},{"key":"e_1_3_2_1_8_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel Ziegler Jeffrey Wu Clemens Winter Chris Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models Are Few-Shot Learners. In NeurIPS. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel Ziegler Jeffrey Wu Clemens Winter Chris Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models Are Few-Shot Learners. In NeurIPS."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.5181820"},{"key":"e_1_3_2_1_10_1","unstructured":"Chi-Chung Chen Chia-Lin Yang and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-gpu Platform. arXiv preprint arXiv:1809.02839. Chi-Chung Chen Chia-Lin Yang and Hsiang-Yun Cheng. 2018. Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-gpu Platform. arXiv preprint arXiv:1809.02839."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Chia-Yu Chen Jungwook Choi Daniel Brand Ankur Agrawal Wei Zhang and Kailash Gopalakrishnan. 2018. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training. In AAAI. Chia-Yu Chen Jungwook Choi Daniel Brand Ankur Agrawal Wei Zhang and Kailash Gopalakrishnan. 2018. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training. In AAAI.","DOI":"10.1609\/aaai.v32i1.11728"},{"key":"e_1_3_2_1_12_1","unstructured":"Chia-Yu Chen Jiamin Ni Songtao Lu Xiaodong Cui Pin-Yu Chen Xiao Sun Naigang Wang Swagath Venkataramani Vijayalakshmi (Viji) Srinivasan Wei Zhang and Kailash Gopalakrishnan. 2020. ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training. In NeurIPS. Chia-Yu Chen Jiamin Ni Songtao Lu Xiaodong Cui Pin-Yu Chen Xiao Sun Naigang Wang Swagath Venkataramani Vijayalakshmi (Viji) Srinivasan Wei Zhang and Kailash Gopalakrishnan. 2020. ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training. In NeurIPS."},{"key":"e_1_3_2_1_13_1","unstructured":"Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174. Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Minsik Cho Ulrich Finkler David Kung and Hillery Hunter. 2019. BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. In MLSys. Minsik Cho Ulrich Finkler David Kung and Hillery Hunter. 2019. BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. In MLSys.","DOI":"10.1147\/JRD.2019.2947013"},{"key":"e_1_3_2_1_15_1","unstructured":"Minsik Cho Vinod Muthusamy Brad Nemanich and Ruchir Puri. 2019. GradZip: Gradient Compression using Alternating Matrix Factorization for Large-scale Deep Learning. In NeurIPS. Minsik Cho Vinod Muthusamy Brad Nemanich and Ruchir Puri. 2019. GradZip: Gradient Compression using Alternating Matrix Factorization for Large-scale Deep Learning. In NeurIPS."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Valeriu Codreanu Damian Podareanu and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291. Valeriu Codreanu Damian Podareanu and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291.","DOI":"10.1109\/MLHPC.2018.8638634"},{"key":"e_1_3_2_1_17_1","unstructured":"Dipankar Das Sasikanth Avancha Dheevatsa Mudigere Karthikeyan Vaidynathan Srinivas Sridharan Dhiraj Kalamkar Bharat Kaul and Pradeep Dubey. 2016. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent. arXiv preprint arXiv:1602.06709. Dipankar Das Sasikanth Avancha Dheevatsa Mudigere Karthikeyan Vaidynathan Srinivas Sridharan Dhiraj Kalamkar Bharat Kaul and Pradeep Dubey. 2016. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent. arXiv preprint arXiv:1602.06709."},{"key":"e_1_3_2_1_18_1","volume-title":"Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng.","author":"Dean Jeffrey","year":"2012","unstructured":"Jeffrey Dean , Greg Corrado , Rajat Monga , Kai Chen , Matthieu Devin , Mark Mao , Marc' aurelio Ranzato , Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. 2012 . Large Scale Distributed Deep Networks. In NeurIPS. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc' aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. 2012. Large Scale Distributed Deep Networks. In NeurIPS."},{"key":"e_1_3_2_1_19_1","volume-title":"Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Nikoli Dryden Naoya Maruyama Tim Moon Tom Benson Marc Snir and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In SC. Nikoli Dryden Naoya Maruyama Tim Moon Tom Benson Marc Snir and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In SC.","DOI":"10.1145\/3295500.3356207"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"P. G. Emma. 1997. Understanding Some Simple Processor-performance Limits. IBM Journal of Research and Development 215\u2013232. P. G. Emma. 1997. Understanding Some Simple Processor-performance Limits. IBM Journal of Research and Development 215\u2013232.","DOI":"10.1147\/rd.413.0215"},{"key":"e_1_3_2_1_22_1","unstructured":"Fartash Faghri Iman Tabrizian Ilia Markov Dan Alistarh Daniel M Roy and Ali Ramezani-Kebrya. 2020. Adaptive Gradient Quantization for Data-Parallel SGD. In NeurIPS. Fartash Faghri Iman Tabrizian Ilia Markov Dan Alistarh Daniel M Roy and Ali Ramezani-Kebrya. 2020. Adaptive Gradient Quantization for Data-Parallel SGD. In NeurIPS."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.5371628"},{"key":"e_1_3_2_1_25_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training Imagenet in 1 Hour. arXiv preprint arXiv:1706.02677. Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate Large Minibatch SGD: Training Imagenet in 1 Hour. arXiv preprint arXiv:1706.02677."},{"key":"e_1_3_2_1_26_1","unstructured":"GRAPHCORE. 2022. IPU. https:\/\/www.graphcore.ai\/products GRAPHCORE. 2022. IPU. https:\/\/www.graphcore.ai\/products"},{"key":"e_1_3_2_1_27_1","volume-title":"Sangeetha Abdu Jyothi, and Roy Campbell","author":"Hashemi Sayed Hadi","year":"2019","unstructured":"Sayed Hadi Hashemi , Sangeetha Abdu Jyothi, and Roy Campbell . 2019 . TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In MLSys . Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In MLSys."},{"key":"e_1_3_2_1_28_1","unstructured":"Samuel Horv\u00e1th and Peter Richtarik. 2021. A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning. In ICLR. Samuel Horv\u00e1th and Peter Richtarik. 2021. A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning. In ICLR."},{"key":"e_1_3_2_1_29_1","unstructured":"Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. In NeurIPS. Yanping Huang Youlong Cheng Ankur Bapna Orhan Firat Dehao Chen Mia Chen HyoukJoong Lee Jiquan Ngiam Quoc V Le Yonghui Wu and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. In NeurIPS."},{"key":"e_1_3_2_1_30_1","volume-title":"Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi.","author":"Jangda Abhinav","year":"2022","unstructured":"Abhinav Jangda , Jun Huang , Guodong Liu , Amir Hossein Nodehi Sabet , Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022 . Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In ASPLOS. Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. 2022. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In ASPLOS."},{"key":"e_1_3_2_1_31_1","unstructured":"Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based Parameter Propagation for Distributed DNN Training. In MLSys. Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based Parameter Propagation for Distributed DNN Training. In MLSys."},{"key":"e_1_3_2_1_32_1","unstructured":"Zhihao Jia Sina Lin Mingyu Gao Matei Zaharia and Alex Aiken. 2020. Improving the accuracy scalability and performance of graph neural networks with roc. MLSys. Zhihao Jia Sina Lin Mingyu Gao Matei Zaharia and Alex Aiken. 2020. Improving the accuracy scalability and performance of graph neural networks with roc. MLSys."},{"key":"e_1_3_2_1_33_1","unstructured":"Zhihao Jia Matei Zaharia and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. In MLSys. Zhihao Jia Matei Zaharia and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks.. In MLSys."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Norman P. Jouppi Cliff Young Nishant Patil David Patterson Gaurav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Boden Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao Chris Clark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben Gelb Tara Vazir Ghaemmaghami Rajendra Gottipati William Gulland Robert Hagmann C. Richard Ho Doug Hogberg John Hu Robert Hundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexander Kaplan Harshit Khaitan Daniel Killebrew Andy Koch Naveen Kumar Steve Lacy James Laudon James Law Diemthu Le Chris Leary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adriana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan Ravi Narayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omernick Narayana Penukonda Andy Phelps Jonathan Ross Matt Ross Amir Salek Emad Samadiani Chris Severn Gregory Sizikov Matthew Snelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gregory Thorson Bo Tian Horia Toma Erick Tuttle Vijay Vasudevan Richard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA. Norman P. Jouppi Cliff Young Nishant Patil David Patterson Gaurav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Boden Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao Chris Clark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben Gelb Tara Vazir Ghaemmaghami Rajendra Gottipati William Gulland Robert Hagmann C. Richard Ho Doug Hogberg John Hu Robert Hundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexander Kaplan Harshit Khaitan Daniel Killebrew Andy Koch Naveen Kumar Steve Lacy James Laudon James Law Diemthu Le Chris Leary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adriana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan Ravi Narayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omernick Narayana Penukonda Andy Phelps Jonathan Ross Matt Ross Amir Salek Emad Samadiani Chris Severn Gregory Sizikov Matthew Snelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gregory Thorson Bo Tian Horia Toma Erick Tuttle Vijay Vasudevan Richard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In ISCA.","DOI":"10.1145\/3140659.3080246"},{"key":"e_1_3_2_1_35_1","unstructured":"Atli Kosson Vitaliy Chiley Abhinav Venigalla Joel Hestness and Urs Koster. 2021. Pipelined Backpropagation at Scale: Training Large Models Without Batches. In MLSys. Atli Kosson Vitaliy Chiley Abhinav Venigalla Joel Hestness and Urs Koster. 2021. Pipelined Backpropagation at Scale: Training Large Models Without Batches. In MLSys."},{"key":"e_1_3_2_1_36_1","unstructured":"Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv preprint arXiv:1404.5997. Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. arXiv preprint arXiv:1404.5997."},{"key":"e_1_3_2_1_37_1","volume-title":"Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.","author":"Lai Guokun","year":"2017","unstructured":"Guokun Lai , Qizhe Xie , Hanxiao Liu , Yiming Yang , and Eduard Hovy . 2017 . Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Jinho Lee Inseok Hwang Soham Shah and Minsik Cho. 2020. FlexReduce: Flexible All-reduce for Distributed Deep Learning on Asymmetric Network Topology. In DAC. Jinho Lee Inseok Hwang Soham Shah and Minsik Cho. 2020. FlexReduce: Flexible All-reduce for Distributed Deep Learning on Asymmetric Network Topology. In DAC.","DOI":"10.1109\/DAC18072.2020.9218538"},{"key":"e_1_3_2_1_39_1","volume-title":"Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su.","author":"Li Mu","year":"2014","unstructured":"Mu Li , David G. Andersen , Jun Woo Park , Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014 . Scaling Distributed Machine Learning with the Parameter Server. In OSDI. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476145"},{"key":"e_1_3_2_1_41_1","unstructured":"Shigang Li and Torsten Hoefler. 2022. Near-Optimal Sparse Allreduce for Distributed Deep Learning. In PPoPP. Shigang Li and Torsten Hoefler. 2022. Near-Optimal Sparse Allreduce for Distributed Deep Learning. In PPoPP."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. VLDB Endowment 3005\u20133018. Shen Li Yanli Zhao Rohan Varma Omkar Salpekar Pieter Noordhuis Teng Li Adam Paszke Jeff Smith Brian Vaughan Pritam Damania and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. VLDB Endowment 3005\u20133018.","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_2_1_43_1","volume-title":"Kumar Kshitij Patel, and Martin Jaggi","author":"Lin Tao","year":"2020","unstructured":"Tao Lin , Sebastian U. Stich , Kumar Kshitij Patel, and Martin Jaggi . 2020 . Don\u2019t Use Large Mini-batches, Use Local SGD. In ICLR. Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. 2020. Don\u2019t Use Large Mini-batches, Use Local SGD. In ICLR."},{"key":"e_1_3_2_1_44_1","unstructured":"Yujun Lin Song Han Huizi Mao Yu Wang and Bill Dally. 2018. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR. Yujun Lin Song Han Huizi Mao Yu Wang and Bill Dally. 2018. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR."},{"key":"e_1_3_2_1_45_1","unstructured":"Ahmed M. Abdelmoniem Ahmed Elzanaty Mohamed-Slim Alouini and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. In MLSys. Ahmed M. Abdelmoniem Ahmed Elzanaty Mohamed-Slim Alouini and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. In MLSys."},{"key":"e_1_3_2_1_46_1","unstructured":"Saeed Maleki Madan Musuvathi Todd Mytkowicz Olli Saarikivi Tianju Xu Vadim Eksarevskiy Jaliya Ekanayake and Emad Barsoum. 2021. Scaling Distributed Training with Adaptive Summation. In MLSys. Saeed Maleki Madan Musuvathi Todd Mytkowicz Olli Saarikivi Tianju Xu Vadim Eksarevskiy Jaliya Ekanayake and Emad Barsoum. 2021. Scaling Distributed Training with Adaptive Summation. In MLSys."},{"key":"e_1_3_2_1_47_1","volume-title":"Turing-nlg: A 17-billion-parameter Language Model by Microsoft. https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/.","year":"2020","unstructured":"Microsoft. 2020 . Turing-nlg: A 17-billion-parameter Language Model by Microsoft. https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/. Microsoft. 2020. Turing-nlg: A 17-billion-parameter Language Model by Microsoft. https:\/\/www.microsoft.com\/en-us\/research\/blog\/turing-nlg-a-17-billion-parameter-language-model-by-microsoft\/."},{"key":"e_1_3_2_1_48_1","unstructured":"Hiroaki Mikami Hisahiro Suganuma Yoshiki Tanaka and Yuichi Kageyama. 2018. Massively Distributed SGD: ImageNet\/ResNet-50 Training in a Flash. arXiv preprint arXiv:1811.05233. Hiroaki Mikami Hisahiro Suganuma Yoshiki Tanaka and Yuichi Kageyama. 2018. Massively Distributed SGD: ImageNet\/ResNet-50 Training in a Flash. arXiv preprint arXiv:1811.05233."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Deepak Narayanan Aaron Harlap Amar Phanishayee Vivek Seshadri Nikhil R. Devanur Gregory R. Ganger Phillip B. Gibbons and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In SOSP. Deepak Narayanan Aaron Harlap Amar Phanishayee Vivek Seshadri Nikhil R. Devanur Gregory R. Ganger Phillip B. Gibbons and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In SOSP.","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Deepak Narayanan Mohammad Shoeybi Jared Casper Patrick LeGresley Mostofa Patwary Vijay Korthikanti Dmitri Vainbrand Prethvi Kashinkunti Julie Bernauer Bryan Catanzaro Amar Phanishayee and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In SC. Deepak Narayanan Mohammad Shoeybi Jared Casper Patrick LeGresley Mostofa Patwary Vijay Korthikanti Dmitri Vainbrand Prethvi Kashinkunti Julie Bernauer Bryan Catanzaro Amar Phanishayee and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In SC.","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_2_1_51_1","volume-title":"Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training","author":"Pal Saptadeep","unstructured":"Saptadeep Pal , Eiman Ebrahimi , Arslan Zulfiqar , Yaosheng Fu , Victor Zhang , Szymon Migacz , David Nellans , and Puneet Gupta . 2019. Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training . IEEE Micro , 91\u2013101. Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, Yaosheng Fu, Victor Zhang, Szymon Migacz, David Nellans, and Puneet Gupta. 2019. Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training. IEEE Micro, 91\u2013101."},{"key":"e_1_3_2_1_52_1","volume-title":"Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern\u00e1ndez.","author":"Paperno Denis","year":"2016","unstructured":"Denis Paperno , Germ\u00e1n Kruszewski , Angeliki Lazaridou , Quan Ngoc Pham , Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern\u00e1ndez. 2016 . The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031. Denis Paperno, Germ\u00e1n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern\u00e1ndez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031."},{"key":"e_1_3_2_1_53_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas Kopf , Edward Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019. PyTorch: An Imperative Style , High-Performance Deep Learning Library . In NeurIPS. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Yanghua Peng Yibo Zhu Yangrui Chen Yixin Bao Bairen Yi Chang Lan Chuan Wu and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In SOSP. Yanghua Peng Yibo Zhu Yangrui Chen Yixin Bao Bairen Yi Chang Lan Chuan Wu and Chuanxiong Guo. 2019. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In SOSP.","DOI":"10.1145\/3341301.3359642"},{"key":"e_1_3_2_1_55_1","unstructured":"Xun Qian Peter Richt\u00e1rik and Tong Zhang. 2021. Error Compensated Distributed SGD Can Be Accelerated. In NeurIPS. Xun Qian Peter Richt\u00e1rik and Tong Zhang. 2021. Error Compensated Distributed SGD Can Be Accelerated. In NeurIPS."},{"key":"e_1_3_2_1_56_1","unstructured":"Alec Radford Jeffrey Wu Dario Amodei Daniela Amodei Jack Clark Miles Brundage and Ilya Sutskever. 2019. Better Language Models and Their Implications. OpenAI blog. Alec Radford Jeffrey Wu Dario Amodei Daniela Amodei Jack Clark Miles Brundage and Ilya Sutskever. 2019. Better Language Models and Their Implications. OpenAI blog."},{"key":"e_1_3_2_1_57_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language Models Are Unsupervised Multitask Learners. OpenAI blog. Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language Models Are Unsupervised Multitask Learners. OpenAI blog."},{"key":"e_1_3_2_1_58_1","volume":"201","author":"Raffel Colin","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J Liu. 201 9. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. arXiv preprint arXiv:1910.10683. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. arXiv preprint arXiv:1910.10683.","journal-title":"Peter J Liu."},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Samyam Rajbhandari Jeff Rasley Olatunji Ruwase and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In SC. Samyam Rajbhandari Jeff Rasley Olatunji Ruwase and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In SC.","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"crossref","unstructured":"Samyam Rajbhandari Olatunji Ruwase Jeff Rasley Shaden Smith and Yuxiong He. 2021. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. In SC. Samyam Rajbhandari Olatunji Ruwase Jeff Rasley Shaden Smith and Yuxiong He. 2021. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. In SC.","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_61_1","volume-title":"Le","author":"Real Esteban","year":"2019","unstructured":"Esteban Real , Alok Aggarwal , Yanping Huang , and Quoc V . Le . 2019 . Regularized Evolution for Image Classifier Architecture Search. In AAAI. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized Evolution for Image Classifier Architecture Search. In AAAI."},{"key":"e_1_3_2_1_62_1","volume-title":"Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He.","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021 . ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX ATC. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX ATC."},{"key":"e_1_3_2_1_63_1","volume-title":"Chandra Bhagavatula, and Yejin Choi.","author":"Sakaguchi Keisuke","year":"2019","unstructured":"Keisuke Sakaguchi , Ronan Le Bras , Chandra Bhagavatula, and Yejin Choi. 2019 . WinoGrande: An Adversarial Winograd Schema Challenge at Scale . arXiv preprint arXiv:1907.10641. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv preprint arXiv:1907.10641."},{"key":"e_1_3_2_1_64_1","unstructured":"Shaohuai Shi Xianhao Zhou Shutao Song Xingyao Wang Zilin Zhu Xue Huang Xinan Jiang Feihu Zhou Zhenyu Guo Liqiang Xie Rui Lan Xianbin Ouyang Yan Zhang Jieqian Wei Jing Gong Weiliang Lin Ping Gao Peng Meng Xiaomin Xu Chenyang Guo Bo Yang Zhibo Chen Yongjian Wu and Xiaowen Chu. 2021. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters. In MLSys. Shaohuai Shi Xianhao Zhou Shutao Song Xingyao Wang Zilin Zhu Xue Huang Xinan Jiang Feihu Zhou Zhenyu Guo Liqiang Xie Rui Lan Xianbin Ouyang Yan Zhang Jieqian Wei Jing Gong Weiliang Lin Ping Gao Peng Meng Xiaomin Xu Chenyang Guo Bo Yang Zhibo Chen Yongjian Wu and Xiaowen Chu. 2021. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters. In MLSys."},{"key":"e_1_3_2_1_65_1","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053. Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053."},{"key":"e_1_3_2_1_66_1","unstructured":"Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas and Vijay Korthikanti. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B A Large-Scale Generative Language Model. arXiv preprint arXiv:2201.11990. Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas and Vijay Korthikanti. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B A Large-Scale Generative Language Model. arXiv preprint arXiv:2201.11990."},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.7220824"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"crossref","unstructured":"Linghao Song Jiachen Mao Youwei Zhuo Xuehai Qian Hai Li and Yiran Chen. 2019. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In HPCA. Linghao Song Jiachen Mao Youwei Zhuo Xuehai Qian Hai Li and Yiran Chen. 2019. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In HPCA.","DOI":"10.1109\/HPCA.2019.00027"},{"key":"e_1_3_2_1_69_1","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML. Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML."},{"key":"e_1_3_2_1_70_1","volume-title":"Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He.","author":"Tang Hanlin","year":"2021","unstructured":"Hanlin Tang , Shaoduo Gan , Ammar Ahmad Awan , Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021 . 1-bit Adam : Communication Efficient Large-Scale Training with Adam\u2019s Convergence Speed. In ICML. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021. 1-bit Adam: Communication Efficient Large-Scale Training with Adam\u2019s Convergence Speed. In ICML."},{"key":"e_1_3_2_1_71_1","volume-title":"Piper: Multidimensional Planner for DNN Parallelization. In NeurIPS.","author":"Tarnawski Jakub","year":"2021","unstructured":"Jakub Tarnawski , Deepak Narayanan , and Amar Phanishayee . 2021 . Piper: Multidimensional Planner for DNN Parallelization. In NeurIPS. Jakub Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In NeurIPS."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"crossref","unstructured":"Rajeev Thakur Rolf Rabenseifner and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. IJHPCA 49\u201366. Rajeev Thakur Rolf Rabenseifner and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. IJHPCA 49\u201366.","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_1_73_1","unstructured":"Trieu H Trinh and Quoc V Le. 2018. A Simple Method for Commonsense Reasoning. arXiv preprint arXiv:1806.02847. Trieu H Trinh and Quoc V Le. 2018. A Simple Method for Commonsense Reasoning. arXiv preprint arXiv:1806.02847."},{"key":"e_1_3_2_1_74_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141 ukasz Kaiser, and Illia Polosukhin . 2017 . Attention is All You Need. In NeurIPS. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NeurIPS."},{"key":"e_1_3_2_1_75_1","volume-title":"Sai Praneeth Karimireddy, and Martin Jaggi","author":"Vogels Thijs","year":"2019","unstructured":"Thijs Vogels , Sai Praneeth Karimireddy, and Martin Jaggi . 2019 . PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization. In NeurIPS. Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. 2019. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization. In NeurIPS."},{"key":"e_1_3_2_1_76_1","volume-title":"Nam Sung Kim, and Yingyan Lin","author":"Wan Cheng","year":"2022","unstructured":"Cheng Wan , Youjie Li , Cameron R Wolfe , Anastasios Kyrillidis , Nam Sung Kim, and Yingyan Lin . 2022 . Pipegcn : Efficient full-graph training of graph convolutional networks with pipelined feature communication. In ICLR. Cheng Wan, Youjie Li, Cameron R Wolfe, Anastasios Kyrillidis, Nam Sung Kim, and Yingyan Lin. 2022. Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication. In ICLR."},{"key":"e_1_3_2_1_77_1","volume-title":"Blink: Fast and Generic Collectives for Distributed ML. In MLSys.","author":"Wang Guanhua","year":"2020","unstructured":"Guanhua Wang , Shivaram Venkataraman , Amar Phanishayee , Nikhil Devanur , Jorgen Thelin , and Ion Stoica . 2020 . Blink: Fast and Generic Collectives for Distributed ML. In MLSys. Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In MLSys."},{"key":"e_1_3_2_1_78_1","volume-title":"Pufferfish: Communication-efficient Models At No Extra Cost. In MLSys.","author":"Wang Hongyi","year":"2021","unstructured":"Hongyi Wang , Saurabh Agarwal , and Dimitris Papailiopoulos . 2021 . Pufferfish: Communication-efficient Models At No Extra Cost. In MLSys. Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. 2021. Pufferfish: Communication-efficient Models At No Extra Cost. In MLSys."},{"key":"e_1_3_2_1_79_1","unstructured":"Minjie Wang Chien-chin Huang and Jinyang Li. 2018. Unifying Data Model and Hybrid Parallelism in Deep Learning Via Tensor Tiling. arXiv preprint arXiv:1805.04170. Minjie Wang Chien-chin Huang and Jinyang Li. 2018. Unifying Data Model and Hybrid Parallelism in Deep Learning Via Tensor Tiling. arXiv preprint arXiv:1805.04170."},{"key":"e_1_3_2_1_80_1","volume-title":"ICLR workshop on representation learning on graphs and manifolds.","author":"Wang Minjie Yu","year":"2019","unstructured":"Minjie Yu Wang . 2019 . Deep graph library: Towards efficient and scalable deep learning on graphs . In ICLR workshop on representation learning on graphs and manifolds. Minjie Yu Wang. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. In ICLR workshop on representation learning on graphs and manifolds."},{"key":"e_1_3_2_1_81_1","unstructured":"Wei Wen Cong Xu Feng Yan Chunpeng Wu Yandan Wang Yiran Chen and Hai Li. 2017. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In NeurIPS. Wei Wen Cong Xu Feng Yan Chunpeng Wu Yandan Wang Yiran Chen and Hai Li. 2017. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In NeurIPS."},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"crossref","unstructured":"An Xu Zhouyuan Huo and Heng Huang. 2021. Step-Ahead Error Feedback for Distributed Training with Compressed Gradient. In AAAI. An Xu Zhouyuan Huo and Heng Huang. 2021. Step-Ahead Error Feedback for Distributed Training with Compressed Gradient. In AAAI.","DOI":"10.1609\/aaai.v35i12.17254"},{"key":"e_1_3_2_1_83_1","unstructured":"Bowen Yang Jian Zhang Jonathan Li Christopher Re Christopher Aberger and Christopher De Sa. 2021. PipeMare: Asynchronous Pipeline Parallel DNN Training. In MLSys. Bowen Yang Jian Zhang Jonathan Li Christopher Re Christopher Aberger and Christopher De Sa. 2021. PipeMare: Asynchronous Pipeline Parallel DNN Training. In MLSys."},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"crossref","unstructured":"Yang You Zhao Zhang Cho-Jui Hsieh James Demmel and Kurt Keutzer. 2018. ImageNet Training In Minutes. In ICPP. Yang You Zhao Zhang Cho-Jui Hsieh James Demmel and Kurt Keutzer. 2018. ImageNet Training In Minutes. In ICPP.","DOI":"10.1145\/3225058.3225069"},{"key":"e_1_3_2_1_85_1","unstructured":"Yue Yu Jiaxiang Wu and Longbo Huang. 2019. Double Quantization for Communication-Efficient Distributed Optimization. In NeurIPS. Yue Yu Jiaxiang Wu and Longbo Huang. 2019. Double Quantization for Communication-Efficient Distributed Optimization. In NeurIPS."},{"key":"e_1_3_2_1_86_1","unstructured":"Rowan Zellers Ari Holtzman Hannah Rashkin Yonatan Bisk Ali Farhadi Franziska Roesner and Yejin Choi. 2019. Defending Against Neural Fake News. In NeurIPS. Rowan Zellers Ari Holtzman Hannah Rashkin Yonatan Bisk Ali Farhadi Franziska Roesner and Yejin Choi. 2019. Defending Against Neural Fake News. In NeurIPS."},{"key":"e_1_3_2_1_87_1","volume-title":"Xing","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xie , and Eric P . Xing . 2017 . Poseidon : An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX ATC. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In USENIX ATC."},{"key":"e_1_3_2_1_88_1","unstructured":"Sixin Zhang Anna E Choromanska and Yann LeCun. 2015. Deep Learning with Elastic Averaging SGD. In NeurIPS. Sixin Zhang Anna E Choromanska and Yann LeCun. 2015. Deep Learning with Elastic Averaging SGD. In NeurIPS."},{"key":"e_1_3_2_1_89_1","volume-title":"Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks","author":"Zhou Qihua","year":"2021","unstructured":"Qihua Zhou , Kun Wang , Haodong Lu , Wenyao Xu , Yanfei Sun , and Song Guo . 2021 . Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks . IEEE TPDS , 900\u2013917. Qihua Zhou, Kun Wang, Haodong Lu, Wenyao Xu, Yanfei Sun, and Song Guo. 2021. Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks. IEEE TPDS, 900\u2013917."}],"event":{"name":"ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2","location":"Vancouver BC Canada","acronym":"ASPLOS '23","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture","SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages"]},"container-title":["Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575712","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3575693.3575712","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:20Z","timestamp":1750182680000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3575693.3575712"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,27]]},"references-count":89,"alternative-id":["10.1145\/3575693.3575712","10.1145\/3575693"],"URL":"https:\/\/doi.org\/10.1145\/3575693.3575712","relation":{},"subject":[],"published":{"date-parts":[[2023,1,27]]},"assertion":[{"value":"2023-01-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}