{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T19:14:29Z","timestamp":1774120469948,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":50,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,9]],"date-time":"2021-08-09T00:00:00Z","timestamp":1628467200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["No.2018YFB0204300"],"award-info":[{"award-number":["No.2018YFB0204300"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Natural Science Foundation of China under Grant","award":["No.62025208"],"award-info":[{"award-number":["No.62025208"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,9]]},"DOI":"10.1145\/3472456.3472497","type":"proceedings-article","created":{"date-parts":[[2021,10,5]],"date-time":"2021-10-05T18:46:04Z","timestamp":1633459564000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN Training"],"prefix":"10.1145","author":[{"given":"Xiangyu","family":"Ye","sequence":"first","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Zhiquan","family":"Lai","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Shengwei","family":"Li","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Lei","family":"Cai","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Ding","family":"Sun","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Linbo","family":"Qiao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Dongsheng","family":"Li","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2021,10,5]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2019. NCCL. https:\/\/developer.nvidia.com\/nccl  2019. NCCL. https:\/\/developer.nvidia.com\/nccl"},{"key":"e_1_3_2_1_2_1","unstructured":"2019. NVLink. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/  2019. NVLink. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/"},{"key":"e_1_3_2_1_3_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg\u00a0S Corrado Andy Davis Jeffrey Dean Matthieu Devin 2015. TensorFlow: Large-scale machine learning on heterogeneous systems.  Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg\u00a0S Corrado Andy Davis Jeffrey Dean Matthieu Devin 2015. TensorFlow: Large-scale machine learning on heterogeneous systems."},{"key":"e_1_3_2_1_4_1","unstructured":"Tom\u00a0B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).  Tom\u00a0B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020)."},{"key":"e_1_3_2_1_5_1","unstructured":"Chi-Chung Chen Chia-Lin Yang and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839(2018).  Chi-Chung Chen Chia-Lin Yang and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839(2018)."},{"key":"e_1_3_2_1_6_1","unstructured":"Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174(2016).  Tianqi Chen Bing Xu Chiyuan Zhang and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174(2016)."},{"key":"e_1_3_2_1_7_1","volume-title":"11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)","author":"Chilimbi Trishul","year":"2014","unstructured":"Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project adam: Building an efficient and scalable deep learning training system . In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) . 571\u2013582. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 571\u2013582."},{"key":"e_1_3_2_1_8_1","volume-title":"International conference on machine learning. PMLR, 1337\u20131345","author":"Coates Adam","year":"2013","unstructured":"Adam Coates , Brody Huval , Tao Wang , David Wu , Bryan Catanzaro , and Ng Andrew . 2013 . Deep learning with COTS HPC systems . In International conference on machine learning. PMLR, 1337\u20131345 . Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In International conference on machine learning. PMLR, 1337\u20131345."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959190"},{"key":"e_1_3_2_1_10_1","volume-title":"2014 USENIX Annual Technical Conference (USENIX ATC 14)","author":"Cui Henggang","year":"2014","unstructured":"Henggang Cui , James Cipar , Qirong Ho , Jin\u00a0Kyu Kim , Seunghak Lee , Abhimanu Kumar , Jinliang Wei , Wei Dai , Gregory\u00a0 R Ganger , Phillip\u00a0 B Gibbons , 2014 . Exploiting bounded staleness to speed up big data analytics . In 2014 USENIX Annual Technical Conference (USENIX ATC 14) . 37\u201348. Henggang Cui, James Cipar, Qirong Ho, Jin\u00a0Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory\u00a0R Ganger, Phillip\u00a0B Gibbons, 2014. Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). 37\u201348."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9195"},{"key":"e_1_3_2_1_12_1","unstructured":"Jeffrey Dean Greg\u00a0S Corrado Rajat Monga Kai Chen Matthieu Devin Quoc\u00a0V Le Mark\u00a0Z Mao Marc\u2019Aurelio Ranzato Andrew Senior Paul Tucker 2012. Large scale distributed deep networks. (2012).  Jeffrey Dean Greg\u00a0S Corrado Rajat Monga Kai Chen Matthieu Devin Quoc\u00a0V Le Mark\u00a0Z Mao Marc\u2019Aurelio Ranzato Andrew Senior Paul Tucker 2012. Large scale distributed deep networks. (2012)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441593"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2019.05.016"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1994.1085"},{"key":"e_1_3_2_1_16_1","unstructured":"Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677(2017).  Priya Goyal Piotr Doll\u00e1r Ross Girshick Pieter Noordhuis Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia and Kaiming He. 2017. Accurate large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677(2017)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/347837.347846"},{"key":"e_1_3_2_1_18_1","volume-title":"More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems 2013","author":"Ho Qirong","year":"2013","unstructured":"Qirong Ho , James Cipar , Henggang Cui , Jin\u00a0Kyu Kim , Seunghak Lee , Phillip\u00a0 B Gibbons , Garth\u00a0 A Gibson , Gregory\u00a0 R Ganger , and Eric\u00a0 P Xing . 2013. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems 2013 ( 2013 ), 1223. Qirong Ho, James Cipar, Henggang Cui, Jin\u00a0Kyu Kim, Seunghak Lee, Phillip\u00a0B Gibbons, Garth\u00a0A Gibson, Gregory\u00a0R Ganger, and Eric\u00a0P Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems 2013 (2013), 1223."},{"key":"e_1_3_2_1_19_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965(2018).","author":"Huang Yanping","year":"2018","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Mia\u00a0Xu Chen , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc\u00a0 V Le , Yonghui Wu , 2018 . Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965(2018). Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia\u00a0Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc\u00a0V Le, Yonghui Wu, 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965(2018)."},{"key":"e_1_3_2_1_20_1","unstructured":"Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960(2019).  Anand Jayarajan Jinliang Wei Garth Gibson Alexandra Fedorova and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960(2019)."},{"key":"e_1_3_2_1_21_1","unstructured":"Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. CoRR abs\/1807.05358(2018). arxiv:1807.05358http:\/\/arxiv.org\/abs\/1807.05358  Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. CoRR abs\/1807.05358(2018). arxiv:1807.05358http:\/\/arxiv.org\/abs\/1807.05358"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901331"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303957"},{"key":"e_1_3_2_1_24_1","unstructured":"Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997(2014). arxiv:1404.5997http:\/\/arxiv.org\/abs\/1404.5997  Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997(2014). arxiv:1404.5997http:\/\/arxiv.org\/abs\/1404.5997"},{"key":"e_1_3_2_1_25_1","volume-title":"On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. (12","author":"Lee Seunghak","year":"2014","unstructured":"Seunghak Lee , Jin\u00a0Kyu Kim , Xun Zheng , Qirong Ho , Garth Gibson , and Eric\u00a0 P Xing . 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. (12 2014 ). https:\/\/doi.org\/10.1184\/R1\/6476048.v1 10.1184\/R1 Seunghak Lee, Jin\u00a0Kyu Kim, Xun Zheng, Qirong Ho, Garth Gibson, and Eric\u00a0P Xing. 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. (12 2014). https:\/\/doi.org\/10.1184\/R1\/6476048.v1"},{"key":"e_1_3_2_1_26_1","volume-title":"Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)","author":"Li Mu","year":"2014","unstructured":"Mu Li , David\u00a0 G. Andersen , Jun\u00a0Woo Park , Alexander\u00a0 J. Smola , Amr Ahmed , Vanja Josifovski , James Long , Eugene\u00a0 J. Shekita , and Bor-Yiing Su . 2014 . Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) . USENIX Association, Broomfield, CO, 583\u2013598. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/li_mu Mu Li, David\u00a0G. Andersen, Jun\u00a0Woo Park, Alexander\u00a0J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene\u00a0J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 583\u2013598. https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/li_mu"},{"key":"e_1_3_2_1_27_1","volume-title":"Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. arXiv preprint arXiv:1811.03619(2018).","author":"Li Youjie","year":"2018","unstructured":"Youjie Li , Mingchao Yu , Songze Li , Salman Avestimehr , Nam\u00a0Sung Kim , and Alexander Schwing . 2018 . Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. arXiv preprint arXiv:1811.03619(2018). Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam\u00a0Sung Kim, and Alexander Schwing. 2018. Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. arXiv preprint arXiv:1811.03619(2018)."},{"key":"e_1_3_2_1_28_1","volume-title":"International Conference on Machine Learning. PMLR, 3043\u20133052","author":"Lian Xiangru","year":"2018","unstructured":"Xiangru Lian , Wei Zhang , Ce Zhang , and Ji Liu . 2018 . Asynchronous decentralized parallel stochastic gradient descent . In International Conference on Machine Learning. PMLR, 3043\u20133052 . Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. PMLR, 3043\u20133052."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/2946645.2946679"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_31_1","unstructured":"Deepak Narayanan Amar Phanishayee Kaiyu Shi Xie Chen and Matei Zaharia. 2020. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503(2020).  Deepak Narayanan Amar Phanishayee Kaiyu Shi Xie Chen and Matei Zaharia. 2020. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503(2020)."},{"key":"e_1_3_2_1_32_1","unstructured":"Feng Niu Benjamin Recht Christopher R\u00e9 and Stephen\u00a0J Wright. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730(2011).  Feng Niu Benjamin Recht Christopher R\u00e9 and Stephen\u00a0J Wright. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730(2011)."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414656"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2019.2935967"},{"key":"e_1_3_2_1_35_1","volume-title":"2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Park H","year":"2020","unstructured":"Jay\u00a0 H Park , Gyeongchan Yun , M\u00a0Yi Chang , Nguyen\u00a0 T Nguyen , Seungmin Lee , Jaesik Choi , Sam\u00a0 H Noh , and Young-ri Choi. 2020 . HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism . In 2020 USENIX Annual Technical Conference (USENIX ATC 20) . 307\u2013321. Jay\u00a0H Park, Gyeongchan Yun, M\u00a0Yi Chang, Nguyen\u00a0T Nguyen, Seungmin Lee, Jaesik Choi, Sam\u00a0H Noh, and Young-ri Choi. 2020. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 307\u2013321."},{"key":"e_1_3_2_1_36_1","unstructured":"Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).  Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017)."},{"key":"e_1_3_2_1_37_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).  Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_2_1_38_1","volume-title":"Language models are unsupervised multitask learners. OpenAI blog 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford , Jeffrey Wu , Rewon Child , David Luan , Dario Amodei , and Ilya Sutskever . 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 ( 2019 ), 9. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33014780"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Rico Sennrich Barry Haddow and Alexandra Birch. 2016. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891(2016).  Rico Sennrich Barry Haddow and Alexandra Birch. 2016. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891(2016).","DOI":"10.18653\/v1\/W16-2323"},{"key":"e_1_3_2_1_42_1","unstructured":"Alexander Sergeev and Mike\u00a0Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799(2018). arxiv:1802.05799http:\/\/arxiv.org\/abs\/1802.05799  Alexander Sergeev and Mike\u00a0Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs\/1802.05799(2018). arxiv:1802.05799http:\/\/arxiv.org\/abs\/1802.05799"},{"key":"e_1_3_2_1_43_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014)."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219869"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_2_1_46_1","unstructured":"Yonghui Wu Mike Schuster Zhifeng Chen Quoc\u00a0V Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao Klaus Macherey 2016. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).  Yonghui Wu Mike Schuster Zhifeng Chen Quoc\u00a0V Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao Klaus Macherey 2016. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016)."},{"key":"e_1_3_2_1_47_1","volume-title":"Proceedings of Machine Learning and Systems 3","author":"Yang Bowen","year":"2021","unstructured":"Bowen Yang , Jian Zhang , Jonathan Li , Christopher R\u00e9 , Christopher Aberger , and Christopher De\u00a0Sa . 2021 . Pipemare: Asynchronous pipeline parallel dnn training . Proceedings of Machine Learning and Systems 3 (2021). Bowen Yang, Jian Zhang, Jonathan Li, Christopher R\u00e9, Christopher Aberger, and Christopher De\u00a0Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CBD.2019.00020"},{"key":"e_1_3_2_1_49_1","volume-title":"2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"Zhang Hao","year":"2017","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xie , and Eric\u00a0 P Xing . 2017 . Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters . In 2017 USENIX Annual Technical Conference (USENIX ATC 17) . 181\u2013193. Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric\u00a0P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181\u2013193."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123978"}],"event":{"name":"ICPP 2021: 50th International Conference on Parallel Processing","location":"Lemont IL USA","acronym":"ICPP 2021"},"container-title":["50th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472497","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3472456.3472497","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:12Z","timestamp":1750193292000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472497"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,9]]},"references-count":50,"alternative-id":["10.1145\/3472456.3472497","10.1145\/3472456"],"URL":"https:\/\/doi.org\/10.1145\/3472456.3472497","relation":{},"subject":[],"published":{"date-parts":[[2021,8,9]]},"assertion":[{"value":"2021-10-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}