{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:51:14Z","timestamp":1773481874523,"version":"3.50.1"},"reference-count":82,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:p>\n            Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via \"system relaxations\":\n            <jats:italic>quantization, decentralization,<\/jats:italic>\n            and\n            <jats:italic>communication delay.<\/jats:italic>\n            However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build Bagua, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), Bagua can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2X) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.\n          <\/jats:p>","DOI":"10.14778\/3503585.3503590","type":"journal-article","created":{"date-parts":[[2022,4,14]],"date-time":"2022-04-14T22:18:07Z","timestamp":1649974687000},"page":"804-813","source":"Crossref","is-referenced-by-count":23,"title":["Bagua"],"prefix":"10.14778","volume":"15","author":[{"given":"Shaoduo","family":"Gan","sequence":"first","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland"}]},{"given":"Jiawei","family":"Jiang","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland"}]},{"given":"Binhang","family":"Yuan","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland"}]},{"given":"Ce","family":"Zhang","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland"}]},{"given":"Xiangru","family":"Lian","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Rui","family":"Wang","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Jianbin","family":"Chang","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Chengjun","family":"Liu","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Hongmei","family":"Shi","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Shengzhuo","family":"Zhang","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Xianghong","family":"Li","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Tengxu","family":"Sun","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Sen","family":"Yang","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]},{"given":"Ji","family":"Liu","sequence":"additional","affiliation":[{"name":"kuaishou technology, China"}]}],"member":"320","published-online":{"date-parts":[[2022,4,14]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. Apex. https:\/\/nvidia.github.io\/apex\/optimizers.html.  [n.d.]. Apex. https:\/\/nvidia.github.io\/apex\/optimizers.html."},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. NCCL. https:\/\/developer.nvidia.com\/nccl.  [n.d.]. NCCL. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_1_3_1","volume-title":"Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283.","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , 2016 . Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283. Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283."},{"key":"e_1_2_1_4_1","volume-title":"QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132","author":"Alistarh Dan","year":"2016","unstructured":"Dan Alistarh , Demjan Grubic , Jerry Li , Ryota Tomioka , and Milan Vojnovic . 2016 . QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132 (2016). Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2016. QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132 (2016)."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977--5987","author":"Alistarh Dan","year":"2018","unstructured":"Dan Alistarh , Torsten Hoefler , Mikael Johansson , Sarit Khirirat , Nikola Konstantinov , and C\u00e9dric Renggli . 2018 . The convergence of sparsified gradient methods . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977--5987 . Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and C\u00e9dric Renggli. 2018. The convergence of sparsified gradient methods. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977--5987."},{"key":"e_1_2_1_6_1","volume-title":"Can Karakus, and Suhas Diggavi.","author":"Basu Debraj","year":"2019","unstructured":"Debraj Basu , Deepesh Data , Can Karakus, and Suhas Diggavi. 2019 . Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations. arXiv preprint arXiv:1906.02367 (2019). Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. 2019. Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations. arXiv preprint arXiv:1906.02367 (2019)."},{"key":"e_1_2_1_7_1","volume-title":"International Conference on Machine Learning. PMLR, 560--569","author":"Bernstein Jeremy","year":"2018","unstructured":"Jeremy Bernstein , Yu-Xiang Wang , Kamyar Azizzadenesheli , and Animashree Anandkumar . 2018 . signSGD: Compressed optimisation for non-convex problems . In International Conference on Machine Learning. PMLR, 560--569 . Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. 2018. signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning. PMLR, 560--569."},{"key":"e_1_2_1_8_1","volume-title":"On biased compression for distributed learning. arXiv preprint arXiv:2002.12410","author":"Beznosikov Aleksandr","year":"2020","unstructured":"Aleksandr Beznosikov , Samuel Horv\u00e1th , Peter Richt\u00e1rik , and Mher Safaryan . 2020. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410 ( 2020 ). Aleksandr Beznosikov, Samuel Horv\u00e1th, Peter Richt\u00e1rik, and Mher Safaryan. 2020. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410 (2020)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007279"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732286.2732292"},{"key":"e_1_2_1_11_1","unstructured":"Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).  Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_1_13_1","volume-title":"11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 571--582.","author":"Chilimbi Trishul","unstructured":"Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014. Project adam: Building an efficient and scalable deep learning training system . In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 571--582. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 571--582."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/2887007.2887019"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 44th Annual International Symposium on Computer Architecture. 561--574","author":"Sa Christopher De","year":"2017","unstructured":"Christopher De Sa , Matthew Feldman , Christopher R\u00e9 , and Kunle Olukotun . 2017 . Understanding and optimizing asynchronous low-precision stochastic gradient descent . In Proceedings of the 44th Annual International Symposium on Computer Architecture. 561--574 . Christopher De Sa, Matthew Feldman, Christopher R\u00e9, and Kunle Olukotun. 2017. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 561--574."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1. 1223--1231","author":"Dean Jeffrey","year":"2012","unstructured":"Jeffrey Dean , Greg S Corrado , Rajat Monga , Kai Chen , Matthieu Devin , Quoc V Le , Mark Z Mao , Marc'Aurelio Ranzato , Andrew Senior , Paul Tucker , 2012 . Large scale distributed deep networks . In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1. 1223--1231 . Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le, Mark Z Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1. 1223--1231."},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"J. Deng W. Dong R. Socher L.-J. Li K. Li and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.  J. Deng W. Dong R. Socher L.-J. Li K. Li and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_18_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_19_1","volume-title":"Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583","author":"Du Jiayu","year":"2018","unstructured":"Jiayu Du , Xingyu Na , Xuechen Liu , and Hui Bu. 2018. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583 ( 2018 ). Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. 2018. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583 (2018)."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3386137"},{"key":"e_1_2_1_21_1","volume-title":"Mehrdad Mahdavi, and Viveck R Cadambe.","author":"Haddadpour Farzin","year":"2019","unstructured":"Farzin Haddadpour , Mohammad Mahdi Kamani , Mehrdad Mahdavi, and Viveck R Cadambe. 2019 . Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. arXiv preprint arXiv:1910.13598 (2019). Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck R Cadambe. 2019. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. arXiv preprint arXiv:1910.13598 (2019)."},{"key":"e_1_2_1_22_1","volume-title":"PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers. arXiv preprint arXiv:2102.03161","author":"He Chaoyang","year":"2021","unstructured":"Chaoyang He , Shen Li , Mahdi Soltanolkotabi , and Salman Avestimehr . 2021. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers. arXiv preprint arXiv:2102.03161 ( 2021 ). Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. 2021. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers. arXiv preprint arXiv:2102.03161 (2021)."},{"key":"e_1_2_1_23_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_2_1_24_1","volume-title":"HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Dehao Chen , Mia Xu Chen , HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019 . GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.. In NeurIPS. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism.. In NeurIPS."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3187009.3177734"},{"key":"e_1_2_1_26_1","unstructured":"Nikita Ivkin Daniel Rothchild Enayat Ullah Ion Stoica Raman Arora etal 2019. Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems. 13144--13154.  Nikita Ivkin Daniel Rothchild Enayat Ullah Ion Stoica Raman Arora et al. 2019. Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems. 13144--13154."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3317315.3317323"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380575"},{"key":"e_1_2_1_29_1","volume-title":"Beyond Data and Model Parallelism for Deep Neural Networks. SysML 2019","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia , Matei Zaharia , and Alex Aiken . 2019. Beyond Data and Model Parallelism for Deep Neural Networks. SysML 2019 ( 2019 ). Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. SysML 2019 (2019)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196892"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035933"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1093\/nsr\/nwx018"},{"key":"e_1_2_1_33_1","volume-title":"14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 463--479.","author":"Jiang Yimin","unstructured":"Yimin Jiang , Yibo Zhu , Chang Lan , Bairen Yi , Yong Cui , and Chuanxiong Guo . 2020. A Unified Architecture for Accelerating Distributed {DNN} Training in Heterogeneous GPU\/CPU Clusters . In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 463--479. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed {DNN} Training in Heterogeneous GPU\/CPU Clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 463--479."},{"key":"e_1_2_1_34_1","volume-title":"International Conference on Machine Learning. PMLR, 3478--3487","author":"Koloskova Anastasia","year":"2019","unstructured":"Anastasia Koloskova , Sebastian Stich , and Martin Jaggi . 2019 . Decentralized stochastic optimization and gossip algorithms with compressed communication . In International Conference on Machine Learning. PMLR, 3478--3487 . Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. 2019. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning. PMLR, 3478--3487."},{"key":"e_1_2_1_35_1","volume-title":"Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E Hinton . 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 ( 2012 ), 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097--1105."},{"key":"e_1_2_1_36_1","volume-title":"Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668","author":"Lepikhin Dmitry","year":"2020","unstructured":"Dmitry Lepikhin , HyoukJoong Lee , Yuanzhong Xu , Dehao Chen , Orhan Firat , Yanping Huang , Maxim Krikun , Noam Shazeer , and Zhifeng Chen . 2020 . Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020). Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741965"},{"key":"e_1_2_1_38_1","volume-title":"Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su.","author":"Li Mu","year":"2014","unstructured":"Mu Li , David G Andersen , Jun Woo Park , Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014 . Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ( {OSDI} 14). 583--598. Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598."},{"key":"e_1_2_1_39_1","volume-title":"PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13","author":"Li Shen","unstructured":"Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13 , 12 ([n.d.]). Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. [n.d.]. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 ([n.d.])."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 8056--8067","author":"Li Youjie","year":"2018","unstructured":"Youjie Li , Mingchao Yu , Songze Li , Salman Avestimehr , Nam Sung Kim , and Alexander Schwing . 2018 . Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 8056--8067 . Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. 2018. Pipe-SGD: a decentralized pipelined SGD framework for distributed deep net training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 8056--8067."},{"key":"e_1_2_1_41_1","volume-title":"TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. arXiv preprint arXiv:2102.07988","author":"Li Zhuohan","year":"2021","unstructured":"Zhuohan Li , Siyuan Zhuang , Shiyuan Guo , Danyang Zhuo , Hao Zhang , Dawn Song , and Ion Stoica . 2021. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. arXiv preprint arXiv:2102.07988 ( 2021 ). Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. arXiv preprint arXiv:2102.07988 (2021)."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems. 5336--5346","author":"Lian Xiangru","year":"2017","unstructured":"Xiangru Lian , Ce Zhang , Huan Zhang , Cho-Jui Hsieh , Wei Zhang , and Ji Liu . 2017 . Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent . In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5336--5346 . Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5336--5346."},{"key":"e_1_2_1_43_1","volume-title":"International Conference on Machine Learning. PMLR, 3043--3052","author":"Lian Xiangru","year":"2018","unstructured":"Xiangru Lian , Wei Zhang , Ce Zhang , and Ji Liu . 2018 . Asynchronous decentralized parallel stochastic gradient descent . In International Conference on Machine Learning. PMLR, 3043--3052 . Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. PMLR, 3043--3052."},{"key":"e_1_2_1_44_1","volume-title":"Use Local SGD. In International Conference on Learning Representations.","author":"Lin Tao","year":"2019","unstructured":"Tao Lin , Sebastian U Stich , Kumar Kshitij Patel , and Martin Jaggi . 2019 . Don't Use Large Mini-batches , Use Local SGD. In International Conference on Learning Representations. Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. 2019. Don't Use Large Mini-batches, Use Local SGD. In International Conference on Learning Representations."},{"key":"e_1_2_1_45_1","volume-title":"Distributed learning systems with first-order methods. arXiv preprint arXiv:2104.05245","author":"Liu Ji","year":"2021","unstructured":"Ji Liu and Ce Zhang . 2021. Distributed learning systems with first-order methods. arXiv preprint arXiv:2104.05245 ( 2021 ). Ji Liu and Ce Zhang. 2021. Distributed learning systems with first-order methods. arXiv preprint arXiv:2104.05245 (2021)."},{"key":"e_1_2_1_46_1","doi-asserted-by":"crossref","unstructured":"Ji Liu Ce Zhang etal 2020. Distributed Learning Systems with First-Order Methods. Foundations and Trends\u00ae in Databases 9 1 (2020) 1 -- 100.  Ji Liu Ce Zhang et al. 2020. Distributed Learning Systems with First-Order Methods. Foundations and Trends \u00ae in Databases 9 1 (2020) 1 -- 100.","DOI":"10.1561\/1900000062"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3363554"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_2_1_49_1","volume-title":"Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan , Amar Phanishayee , Kaiyu Shi , Xie Chen , and Matei Zaharia . 2020. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503 ( 2020 ). Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503 (2020)."},{"key":"e_1_2_1_50_1","volume-title":"International Conference on Machine Learning. PMLR, 3750--3758","author":"Nguyen Lam","year":"2018","unstructured":"Lam Nguyen , Phuong Ha Nguyen , Marten Dijk , Peter Richt\u00e1rik , Katya Scheinberg , and Martin Tak\u00e1c . 2018 . SGD and Hogwild! convergence without the bounded gradients assumption . In International Conference on Machine Learning. PMLR, 3750--3758 . Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richt\u00e1rik, Katya Scheinberg, and Martin Tak\u00e1c. 2018. SGD and Hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning. PMLR, 3750--3758."},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the 24th International Conference on Neural Information Processing Systems. 693--701","author":"Niu Feng","year":"2011","unstructured":"Feng Niu , Benjamin Recht , Christopher Re , and Stephen J Wright . 2011 . HOG-WILD! a lock-free approach to parallelizing stochastic gradient descent . In Proceedings of the 24th International Conference on Neural Information Processing Systems. 693--701 . Feng Niu, Benjamin Recht, Christopher Re, and Stephen J Wright. 2011. HOG-WILD! a lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems. 693--701."},{"key":"e_1_2_1_52_1","volume-title":"International Conference on Machine Learning. PMLR, 2788--2797","author":"Peng Hao","year":"2017","unstructured":"Hao Peng , Shandian Zhe , Xiao Zhang , and Yuan Qi . 2017 . Asynchronous distributed variational Gaussian process for regression . In International Conference on Machine Learning. PMLR, 2788--2797 . Hao Peng, Shandian Zhe, Xiao Zhang, and Yuan Qi. 2017. Asynchronous distributed variational Gaussian process for regression. In International Conference on Machine Learning. PMLR, 2788--2797."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24685-5_1"},{"key":"e_1_2_1_54_1","volume-title":"100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . 2016. Squad : 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ( 2016 ). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352083"},{"key":"e_1_2_1_56_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso . 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 ( 2018 ). Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)."},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 10435--10444","author":"Shazeer Noam","year":"2018","unstructured":"Noam Shazeer , Youlong Cheng , Niki Parmar , Dustin Tran , Ashish Vaswani , Penporn Koanantakool , Peter Hawkins , HyoukJoong Lee , Mingsheng Hong , Cliff Young , 2018 . Mesh-TensorFlow: deep learning for supercomputers . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 10435--10444 . Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-TensorFlow: deep learning for supercomputers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 10435--10444."},{"key":"e_1_2_1_58_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019 . Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_2_1_59_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_1_60_1","volume-title":"International Conference on Machine Learning. PMLR, 4674--4683","author":"Simsekli Umut","year":"2018","unstructured":"Umut Simsekli , Cagatay Yildiz , Than Huy Nguyen , Taylan Cemgil , and Gael Richard . 2018 . Asynchronous stochastic quasi-Newton MCMC for non-convex optimization . In International Conference on Machine Learning. PMLR, 4674--4683 . Umut Simsekli, Cagatay Yildiz, Than Huy Nguyen, Taylan Cemgil, and Gael Richard. 2018. Asynchronous stochastic quasi-Newton MCMC for non-convex optimization. In International Conference on Machine Learning. PMLR, 4674--4683."},{"key":"e_1_2_1_61_1","volume-title":"Local SGD Converges Fast and Communicates Little. In International Conference on Learning Representations.","author":"Stich Sebastian U","year":"2018","unstructured":"Sebastian U Stich . 2018 . Local SGD Converges Fast and Communicates Little. In International Conference on Learning Representations. Sebastian U Stich. 2018. Local SGD Converges Fast and Communicates Little. In International Conference on Learning Representations."},{"key":"e_1_2_1_62_1","first-page":"4447","article-title":"Sparsified SGD with Memory","volume":"31","author":"Stich Sebastian U","year":"2018","unstructured":"Sebastian U Stich , Jean-Baptiste Cordonnier , and Martin Jaggi . 2018 . Sparsified SGD with Memory . Advances in Neural Information Processing Systems 31 (2018), 4447 -- 4458 . Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with Memory. Advances in Neural Information Processing Systems 31 (2018), 4447--4458.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_63_1","volume-title":"Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He.","author":"Tang Hanlin","year":"2021","unstructured":"Hanlin Tang , Shaoduo Gan , Ammar Ahmad Awan , Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021 . 1-bit Adam : Communication Efficient Large-Scale Training with Adam's Convergence Speed . arXiv preprint arXiv:2102.02888 (2021). Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 2021. 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv preprint arXiv:2102.02888 (2021)."},{"key":"e_1_2_1_64_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7663--7673","author":"Tang Hanlin","year":"2018","unstructured":"Hanlin Tang , Shaoduo Gan , Ce Zhang , Tong Zhang , and Ji Liu . 2018 . Communication compression for decentralized training . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7663--7673 . Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. 2018. Communication compression for decentralized training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7663--7673."},{"key":"e_1_2_1_65_1","volume-title":"Deepsqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1907.07346","author":"Tang Hanlin","year":"2019","unstructured":"Hanlin Tang , Xiangru Lian , Shuang Qiu , Lei Yuan , Ce Zhang , Tong Zhang , and Ji Liu . 2019 . Deepsqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1907.07346 (2019). Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. 2019. Deepsqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1907.07346 (2019)."},{"key":"e_1_2_1_66_1","volume-title":"International Conference on Machine Learning. PMLR, 4848--4856","author":"Tang Hanlin","year":"2018","unstructured":"Hanlin Tang , Xiangru Lian , Ming Yan , Ce Zhang , and Ji Liu . 2018 . D2: Decentralized training over decentralized data . In International Conference on Machine Learning. PMLR, 4848--4856 . Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 2018. D2: Decentralized training over decentralized data. In International Conference on Machine Learning. PMLR, 4848--4856."},{"key":"e_1_2_1_67_1","volume-title":"International Conference on Machine Learning. PMLR, 6155--6165","author":"Tang Hanlin","year":"2019","unstructured":"Hanlin Tang , Chen Yu , Xiangru Lian , Tong Zhang , and Ji Liu . 2019 . Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression . In International Conference on Machine Learning. PMLR, 6155--6165 . Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. 2019. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning. PMLR, 6155--6165."},{"key":"e_1_2_1_68_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9872--9883","author":"Wang Hongyi","year":"2018","unstructured":"Hongyi Wang , Scott Sievert , Zachary Charles , Shengchao Liu , Stephen Wright , and Dimitris Papailiopoulos . 2018 . ATOMO: communication-efficient learning via atomic sparsification . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9872--9883 . Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, and Dimitris Papailiopoulos. 2018. ATOMO: communication-efficient learning via atomic sparsification. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9872--9883."},{"key":"e_1_2_1_69_1","volume-title":"Systems and Machine Learning (SysML) Conference.","author":"Wang Jianyu","year":"2019","unstructured":"Jianyu Wang and Gauri Joshi . 2019 . Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-update SGD . In Systems and Machine Learning (SysML) Conference. Jianyu Wang and Gauri Joshi. 2019. Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-update SGD. In Systems and Machine Learning (SysML) Conference."},{"key":"e_1_2_1_70_1","volume-title":"International Conference on Machine Learning. PMLR, 3636--3645","author":"Wang Jialei","year":"2017","unstructured":"Jialei Wang , Mladen Kolar , Nathan Srebro , and Tong Zhang . 2017 . Efficient distributed learning with sparsity . In International Conference on Machine Learning. PMLR, 3636--3645 . Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. 2017. Efficient distributed learning with sparsity. In International Conference on Machine Learning. PMLR, 3636--3645."},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_2_1_72_1","first-page":"1299","article-title":"Gradient sparsification for communication-efficient distributed optimization","volume":"31","author":"Wangni J","year":"2018","unstructured":"J Wangni , J Liu , J Wang , and T Zhang . 2018 . Gradient sparsification for communication-efficient distributed optimization . Advances in Neural Information Processing Systems 31 (2018), 1299 . J Wangni, J Liu, J Wang, and T Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. Advances in Neural Information Processing Systems 31 (2018), 1299.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_73_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems. 1508--1518","author":"Wen Wei","year":"2017","unstructured":"Wei Wen , Cong Xu , Feng Yan , Chunpeng Wu , Yandan Wang , Yiran Chen , and Hai Li . 2017 . TernGrad: ternary gradients to reduce communication in distributed deep learning . In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1508--1518 . Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: ternary gradients to reduce communication in distributed deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1508--1518."},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33015693"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457399"},{"key":"e_1_2_1_76_1","volume-title":"International Conference on Machine Learning. PMLR, 4035--4043","author":"Zhang Hantian","year":"2017","unstructured":"Hantian Zhang , Jerry Li , Kaan Kara , Dan Alistarh , Ji Liu , and Ce Zhang . 2017 . Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning . In International Conference on Machine Learning. PMLR, 4035--4043 . Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2017. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning. PMLR, 4035--4043."},{"key":"e_1_2_1_77_1","volume-title":"Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters. In 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17). 181--193.","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xie , and Eric P Xing . 2017 . Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters. In 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17). 181--193. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters. In 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17). 181--193."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314038"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00194"},{"key":"e_1_2_1_80_1","volume-title":"International Conference on Machine Learning. PMLR, 4120--4129","author":"Zheng Shuxin","year":"2017","unstructured":"Shuxin Zheng , Qi Meng , Taifeng Wang , Wei Chen , Nenghai Yu , Zhi-Ming Ma , and Tie-Yan Liu . 2017 . Asynchronous stochastic gradient descent with delay compensation . In International Conference on Machine Learning. PMLR, 4120--4129 . Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. 2017. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning. PMLR, 4120--4129."},{"key":"e_1_2_1_81_1","volume-title":"International Conference on Machine Learning. PMLR, 5970--5979","author":"Zhou Zhengyuan","year":"2018","unstructured":"Zhengyuan Zhou , Panayotis Mertikopoulos , Nicholas Bambos , Peter Glynn , Yinyu Ye , Li-Jia Li , and Li Fei-Fei . 2018 . Distributed asynchronous optimization with unbounded delays: How slow can you go? . In International Conference on Machine Learning. PMLR, 5970--5979 . Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. 2018. Distributed asynchronous optimization with unbounded delays: How slow can you go?. In International Conference on Machine Learning. PMLR, 5970--5979."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733082"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3503585.3503590","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:26:24Z","timestamp":1672223184000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3503585.3503590"}},"subtitle":["scaling up distributed learning with system relaxations"],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":82,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["10.14778\/3503585.3503590"],"URL":"https:\/\/doi.org\/10.14778\/3503585.3503590","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,12]]}}}