{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T12:26:33Z","timestamp":1780575993615,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,2,17]],"date-time":"2021-02-17T00:00:00Z","timestamp":1613520000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,2,17]]},"DOI":"10.1145\/3437801.3441593","type":"proceedings-article","created":{"date-parts":[[2021,2,20]],"date-time":"2021-02-20T23:04:20Z","timestamp":1613862260000},"page":"431-445","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":190,"title":["DAPPLE"],"prefix":"10.1145","author":[{"given":"Shiqing","family":"Fan","sequence":"first","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yi","family":"Rong","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chen","family":"Meng","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zongyan","family":"Cao","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Siyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhen","family":"Zheng","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chuan","family":"Wu","sequence":"additional","affiliation":[{"name":"The University of Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guoping","family":"Long","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jun","family":"Yang","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lixue","family":"Xia","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lansong","family":"Diao","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaoyong","family":"Liu","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Lin","sequence":"additional","affiliation":[{"name":"Alibaba Group"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,2,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2016. A New Lightweight Modular and Scalable Deep Learning Framework. https:\/\/caffe2.ai\/.  2016. A New Lightweight Modular and Scalable Deep Learning Framework. https:\/\/caffe2.ai\/."},{"key":"e_1_3_2_1_2_1","unstructured":"2018. Baidu-allreduce. https:\/\/github.com\/baidu-research\/baidu-allreduce.  2018. Baidu-allreduce. https:\/\/github.com\/baidu-research\/baidu-allreduce."},{"key":"e_1_3_2_1_3_1","unstructured":"2019. Byteps A high performance and generic framework for distributed DNN training. https:\/\/github.com\/bytedance\/byteps.  2019. Byteps A high performance and generic framework for distributed DNN training. https:\/\/github.com\/bytedance\/byteps."},{"key":"e_1_3_2_1_4_1","unstructured":"2019. GpipeTalk. https:\/\/www.youtube.com\/watch?v=9s2cum25Kkc.  2019. GpipeTalk. https:\/\/www.youtube.com\/watch?v=9s2cum25Kkc."},{"key":"e_1_3_2_1_5_1","unstructured":"2019. Gradients Accumulation-PyTorch. https:\/\/gist.github.com\/thomwolf\/ac7a7da6b1888c2eeac8ac8b9b05d3d3.  2019. Gradients Accumulation-PyTorch. https:\/\/gist.github.com\/thomwolf\/ac7a7da6b1888c2eeac8ac8b9b05d3d3."},{"key":"e_1_3_2_1_6_1","unstructured":"2019. Gradients Accumulation-Tensorflow. https:\/\/github.com\/tensorflow\/tensorflow\/pull\/32576.  2019. Gradients Accumulation-Tensorflow. https:\/\/github.com\/tensorflow\/tensorflow\/pull\/32576."},{"key":"e_1_3_2_1_7_1","unstructured":"2019. NCCL. https:\/\/developer.nvidia.com\/nccl.  2019. NCCL. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_3_2_1_8_1","unstructured":"2019. NVLink. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/.  2019. NVLink. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/."},{"key":"e_1_3_2_1_9_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dan Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http:\/\/tensorflow.org\/ Software available from tensorflow.org.  Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dan Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http:\/\/tensorflow.org\/ Software available from tensorflow.org."},{"key":"e_1_3_2_1_10_1","volume-title":"Yuan Cao, George Foster, Colin Cherry, et al.","author":"Arivazhagan Naveen","year":"2019","unstructured":"Naveen Arivazhagan , Ankur Bapna , Orhan Firat , Dmitry Lepikhin , Melvin Johnson , Maxim Krikun , Mia Xu Chen , Yuan Cao, George Foster, Colin Cherry, et al. 2019 . Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019 (2019). Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. 2019. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019 (2019)."},{"key":"e_1_3_2_1_11_1","volume-title":"Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839","author":"Chen Chi-Chung","year":"2018","unstructured":"Chi-Chung Chen , Chia-Lin Yang , and Hsiang-Yun Cheng . 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 ( 2018 ). Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839 (2018)."},{"key":"e_1_3_2_1_12_1","volume-title":"Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen , Bing Xu , Chiyuan Zhang , and Carlos Guestrin . 2016. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174 ( 2016 ). Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959190"},{"key":"e_1_3_2_1_14_1","unstructured":"Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Marc'aurelio Ranzato Andrew Senior Paul Tucker Ke Yang etal 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.  Jeffrey Dean Greg Corrado Rajat Monga Kai Chen Matthieu Devin Mark Mao Marc'aurelio Ranzato Andrew Senior Paul Tucker Ke Yang et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231."},{"key":"e_1_3_2_1_15_1","unstructured":"Julien Demouth. 2015. CUDA Pro Tip: Minimize the Tail Effect. https:\/\/devblogs.nvidia.com\/cuda-pro-tip-minimize-the-tail-effect\/  Julien Demouth. 2015. CUDA Pro Tip: Minimize the Tail Effect. https:\/\/devblogs.nvidia.com\/cuda-pro-tip-minimize-the-tail-effect\/"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356207"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3322795.3331463"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3322795.3331461"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00018"},{"key":"e_1_3_2_1_20_1","volume-title":"large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He. 2017. Accurate , large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 ( 2017 ). Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)."},{"key":"e_1_3_2_1_21_1","volume-title":"Tiresias: A {GPU} cluster manager for distributed deep learning. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 485--500.","author":"Gu Juncheng","year":"2019","unstructured":"Juncheng Gu , Mosharaf Chowdhury , Kang G Shin , Yibo Zhu , Myeongjae Jeon , Junjie Qian , Hongqiang Liu , and Chuanxiong Guo . 2019 . Tiresias: A {GPU} cluster manager for distributed deep learning. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 485--500. Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A {GPU} cluster manager for distributed deep learning. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 485--500."},{"key":"e_1_3_2_1_22_1","volume-title":"XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training. arXiv preprint arXiv:1911.04610","author":"Guan Lei","year":"2019","unstructured":"Lei Guan , Wotao Yin , Dongsheng Li , and Xicheng Lu. 2019. XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training. arXiv preprint arXiv:1911.04610 ( 2019 ). Lei Guan, Wotao Yin, Dongsheng Li, and Xicheng Lu. 2019. XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training. arXiv preprint arXiv:1911.04610 (2019)."},{"key":"e_1_3_2_1_23_1","volume-title":"Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , and Phil Gibbons . 2018 . Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018). Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)."},{"key":"e_1_3_2_1_24_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103--112.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Dehao Chen , Mia Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V Le , Yonghui Wu , 2019 . Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103--112. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103--112."},{"key":"e_1_3_2_1_25_1","volume-title":"Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574","author":"Huo Zhouyuan","year":"2018","unstructured":"Zhouyuan Huo , Bin Gu , Qian Yang , and Heng Huang . 2018. Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574 ( 2018 ). Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. 2018. Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574 (2018)."},{"key":"e_1_3_2_1_26_1","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain , Ajay Jain , Aniruddha Nrusimha , Amir Gholami , Pieter Abbeel , Joseph Gonzalez , Kurt Keutzer , and Ion Stoica . 2020 . Checkmate: Breaking the memory wall with optimal tensor rematerialization . Proceedings of Machine Learning and Systems 2 (2020), 497 -- 511 . Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497--511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_27_1","volume-title":"Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960","author":"Jayarajan Anand","year":"2019","unstructured":"Anand Jayarajan , Jinliang Wei , Garth Gibson , Alexandra Fedorova , and Gennady Pekhimenko . 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960 ( 2019 ). Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960 (2019)."},{"key":"e_1_3_2_1_28_1","unstructured":"Xianyan Jia Shutao Song Wei He Yangzihao Wang Haidong Rong Feihu Zhou Liqiang Xie Zhenyu Guo Yuanzhou Yang Liwei Yu etal 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018).  Xianyan Jia Shutao Song Wei He Yangzihao Wang Haidong Rong Feihu Zhou Liqiang Xie Zhenyu Guo Yuanzhou Yang Liwei Yu et al. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018)."},{"key":"e_1_3_2_1_29_1","unstructured":"Zhihao Jia Sina Lin Charles R Qi and Alex Aiken. 2018. Exploring the Hidden Dimension in Accelerating Convolutional Neural Networks. (2018).  Zhihao Jia Sina Lin Charles R Qi and Alex Aiken. 2018. Exploring the Hidden Dimension in Accelerating Convolutional Neural Networks. (2018)."},{"key":"e_1_3_2_1_30_1","volume-title":"Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358","author":"Jia Zhihao","year":"2018","unstructured":"Zhihao Jia , Matei Zaharia , and Alex Aiken . 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 ( 2018 ). Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018)."},{"key":"e_1_3_2_1_31_1","volume-title":"torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models. arXiv preprint arXiv:2004.09910","author":"Kim Chiheon","year":"2020","unstructured":"Chiheon Kim , Heungsub Lee , Myungryong Jeong , Woonhyuk Baek , Boogeon Yoon , Ildoo Kim , Sungbin Lim , and Sungwoong Kim . 2020. torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models. arXiv preprint arXiv:2004.09910 ( 2020 ). Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020. torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models. arXiv preprint arXiv:2004.09910 (2020)."},{"key":"e_1_3_2_1_32_1","volume-title":"One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997","author":"Krizhevsky Alex","year":"2014","unstructured":"Alex Krizhevsky . 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 ( 2014 ). Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3184407.3184424"},{"key":"e_1_3_2_1_34_1","volume-title":"Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668","author":"Lepikhin Dmitry","year":"2020","unstructured":"Dmitry Lepikhin , HyoukJoong Lee , Yuanzhong Xu , Dehao Chen , Orhan Firat , Yanping Huang , Maxim Krikun , Noam Shazeer , and Zhifeng Chen . 2020 . Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020). Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/3305890.3305932"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_1_37_1","volume-title":"Memory-Efficient Pipeline-Parallel DNN Training. arXiv preprint arXiv:2006.09503","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan , Amar Phanishayee , Kaiyu Shi , Xie Chen , and Matei Zaharia . 2020. Memory-Efficient Pipeline-Parallel DNN Training. arXiv preprint arXiv:2006.09503 ( 2020 ). Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-Efficient Pipeline-Parallel DNN Training. arXiv preprint arXiv:2006.09503 (2020)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2019.2935967"},{"key":"e_1_3_2_1_39_1","volume-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. https:\/\/arxiv.org\/abs\/1910.10683","author":"Raffel Colin","year":"2019","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and J. Peter Liu . 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. https:\/\/arxiv.org\/abs\/1910.10683 ( 2019 ). Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and J. Peter Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. https:\/\/arxiv.org\/abs\/1910.10683 (2019)."},{"key":"e_1_3_2_1_40_1","volume-title":"Know What You Don't Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822","author":"Rajpurkar Pranav","year":"2018","unstructured":"Pranav Rajpurkar , Robin Jia , and Percy Liang . 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822 ( 2018 ). Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018)."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein etal 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 3 (2015) 211--252.  Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 3 (2015) 211--252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_42_1","volume-title":"Edinburgh neural machine translation systems for WMT 16. arXiv preprint arXiv:1606.02891","author":"Sennrich Rico","year":"2016","unstructured":"Rico Sennrich , Barry Haddow , and Alexandra Birch . 2016. Edinburgh neural machine translation systems for WMT 16. arXiv preprint arXiv:1606.02891 ( 2016 ). Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh neural machine translation systems for WMT 16. arXiv preprint arXiv:1606.02891 (2016)."},{"key":"e_1_3_2_1_43_1","volume-title":"Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799","author":"Sergeev Alexander","year":"2018","unstructured":"Alexander Sergeev and Mike Del Balso . 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 ( 2018 ). Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)."},{"key":"e_1_3_2_1_44_1","unstructured":"Rich Sutton. 2019. The Bitter Lesson. http:\/\/www.incompleteideas.net\/IncIdeas\/BitterLesson.html.  Rich Sutton. 2019. The Bitter Lesson. http:\/\/www.incompleteideas.net\/IncIdeas\/BitterLesson.html."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219869"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_3_2_1_47_1","volume-title":"Characterizing Deep Learning Training Workloads on Alibaba-PAI. arXiv preprint arXiv:1910.05930","author":"Wang Mengdi","year":"2019","unstructured":"Mengdi Wang , Chen Meng , Guoping Long , Chuan Wu , Jun Yang , Wei Lin , and Yangqing Jia . 2019. Characterizing Deep Learning Training Workloads on Alibaba-PAI. arXiv preprint arXiv:1910.05930 ( 2019 ). Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, and Yangqing Jia. 2019. Characterizing Deep Learning Training Workloads on Alibaba-PAI. arXiv preprint arXiv:1910.05930 (2019)."},{"key":"e_1_3_2_1_48_1","volume-title":"Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads. arXiv preprint arXiv:2007.04069","author":"Wang Siyu","year":"2020","unstructured":"Siyu Wang , Yi Rong , Shiqing Fan , Zhen Zheng , LanSong Diao , Guoping Long , Jun Yang , Xiaoyong Liu , and Wei Lin . 2020. Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads. arXiv preprint arXiv:2007.04069 ( 2020 ). Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guoping Long, Jun Yang, Xiaoyong Liu, and Wei Lin. 2020. Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads. arXiv preprint arXiv:2007.04069 (2020)."},{"key":"e_1_3_2_1_49_1","volume-title":"PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv preprint arXiv:1910.05124","author":"Yang Bowen","year":"2019","unstructured":"Bowen Yang , Jian Zhang , Jonathan Li , Christopher R\u00e9 , Christopher R Aberger , and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv preprint arXiv:1910.05124 ( 2019 ). Bowen Yang, Jian Zhang, Jonathan Li, Christopher R\u00e9, Christopher R Aberger, and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv preprint arXiv:1910.05124 (2019)."},{"key":"e_1_3_2_1_50_1","volume-title":"XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang , Zihang Dai , Yiming Yang , Jaime Carbonell , Ruslan Salakhutdinov , and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 ( 2019 ). Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019)."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356137"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CBD.2019.00020"},{"key":"e_1_3_2_1_53_1","volume-title":"Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17)","author":"Zhang Hao","unstructured":"Hao Zhang , Zeyu Zheng , Shizhen Xu , Wei Dai , Qirong Ho , Xiaodan Liang , Zhiting Hu , Jinliang Wei , Pengtao Xi , and Eric P. Xing . 2017 . Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) . USENIX Association, Santa Clara, CA, 181--193. https:\/\/www.usenix.org\/conference\/atc17\/technical-sessions\/presentation\/zhang Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xi, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 181--193. https:\/\/www.usenix.org\/conference\/atc17\/technical-sessions\/presentation\/zhang"},{"key":"e_1_3_2_1_54_1","volume-title":"Fusion-stitching: boosting memory intensive computations for deep learning workloads. arXiv preprint arXiv:2009.10924","author":"Zheng Zhen","year":"2020","unstructured":"Zhen Zheng , Pengzhan Zhao , Guoping Long , Feiwen Zhu , Kai Zhu , Wenyi Zhao , Lansong Diao , Jun Yang , and Wei Lin . 2020. Fusion-stitching: boosting memory intensive computations for deep learning workloads. arXiv preprint arXiv:2009.10924 ( 2020 ). Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, and Wei Lin. 2020. Fusion-stitching: boosting memory intensive computations for deep learning workloads. arXiv preprint arXiv:2009.10924 (2020)."}],"event":{"name":"PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","location":"Virtual Event Republic of Korea","acronym":"PPoPP '21","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441593","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3437801.3441593","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:25Z","timestamp":1750191445000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3437801.3441593"}},"subtitle":["a pipelined data parallel approach for training large models"],"short-title":[],"issued":{"date-parts":[[2021,2,17]]},"references-count":54,"alternative-id":["10.1145\/3437801.3441593","10.1145\/3437801"],"URL":"https:\/\/doi.org\/10.1145\/3437801.3441593","relation":{},"subject":[],"published":{"date-parts":[[2021,2,17]]},"assertion":[{"value":"2021-02-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}