{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,22]],"date-time":"2026-06-22T22:46:59Z","timestamp":1782168419859,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":45,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,11,13]],"date-time":"2021-11-13T00:00:00Z","timestamp":1636761600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,11,14]]},"DOI":"10.1145\/3458817.3476205","type":"proceedings-article","created":{"date-parts":[[2021,10,21]],"date-time":"2021-10-21T04:49:21Z","timestamp":1634791761000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":238,"title":["ZeRO-infinity"],"prefix":"10.1145","author":[{"given":"Samyam","family":"Rajbhandari","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Olatunji","family":"Ruwase","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jeff","family":"Rasley","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shaden","family":"Smith","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuxiong","family":"He","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,11,13]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs\/1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs\/1810.04805 , 2018 . Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs\/1810.04805, 2018."},{"key":"e_1_3_2_2_2_1","volume-title":"Language models are unsupervised multitask learners","author":"Radford Alec","year":"2019","unstructured":"Alec Radford , Jeff Wu , Rewon Child , David Luan , Dario Amodei , and Ilya Sutskever . Language models are unsupervised multitask learners . 2019 . Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019."},{"key":"e_1_3_2_2_3_1","volume-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel Colin","year":"2019","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J. Liu . Exploring the limits of transfer learning with a unified text-to-text transformer , 2019 . Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019."},{"key":"e_1_3_2_2_4_1","volume-title":"Language models are few-shot learners. arXiv preprint arXiv:2005.14165","author":"Brown Tom B","year":"2020","unstructured":"Tom B Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020 . Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020."},{"key":"e_1_3_2_2_5_1","volume-title":"Scaling laws for neural language models","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , and Dario Amodei . Scaling laws for neural language models , 2020 . Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020."},{"key":"e_1_3_2_2_6_1","volume-title":"Deep contextualized word representations. arXiv preprint arXiv:1802.05365","author":"Peters Matthew E","year":"2018","unstructured":"Matthew E Peters , Mark Neumann , Mohit Iyyer , Matt Gardner , Christopher Clark , Kenton Lee , and Luke Zettlemoyer . Deep contextualized word representations. arXiv preprint arXiv:1802.05365 , 2018 . Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018."},{"key":"e_1_3_2_2_7_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . Megatron-lm: Training multi-billion parameter language models using model parallelism , 2019 . Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019."},{"key":"e_1_3_2_2_8_1","volume-title":"Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs\/1806.03377","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil R. Devanur , Gregory R. Ganger , and Phillip B. Gibbons . Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs\/1806.03377 , 2018 . Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs\/1806.03377, 2018."},{"key":"e_1_3_2_2_9_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs\/1811.06965","author":"Huang Yanping","year":"2018","unstructured":"Yanping Huang , Yonglong Cheng , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam , Quoc V. Le , and Zhifeng Chen . Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs\/1811.06965 , 2018 . Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs\/1811.06965, 2018."},{"key":"e_1_3_2_2_10_1","volume-title":"Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503","author":"Narayanan Deepak","year":"2020","unstructured":"Deepak Narayanan , Amar Phanishayee , Kaiyu Shi , Xie Chen , and Matei Zaharia . Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503 , 2020 . Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503, 2020."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_2_12_1","volume-title":"Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021 . Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing Billion-Scale Model Training, 2021."},{"key":"e_1_3_2_2_13_1","volume-title":"DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/","year":"2020","unstructured":"Microsoft. DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/ , 2020 . Microsoft. DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/, 2020."},{"key":"e_1_3_2_2_14_1","volume-title":"Megatron-LM: Software repository. https:\/\/github.com\/NVIDIA\/Megatron-LM","author":"Shoeybi Mohammad","year":"2021","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . Megatron-LM: Software repository. https:\/\/github.com\/NVIDIA\/Megatron-LM , 2021 . Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Software repository. https:\/\/github.com\/NVIDIA\/Megatron-LM, 2021."},{"key":"e_1_3_2_2_15_1","volume-title":"DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/","author":"Team DeepSpeed","year":"2020","unstructured":"DeepSpeed Team and Rangan Majumder . DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/ , 2020 . DeepSpeed Team and Rangan Majumder. DeepSpeed: Extreme-scale model training for everyone. https:\/\/www.microsoft.com\/en-us\/research\/blog\/deepspeed-extreme-scale-model-training-for-everyone\/, 2020."},{"key":"e_1_3_2_2_16_1","volume-title":"RiseLab Medium Post","author":"Gholami Amir","year":"2021","unstructured":"Amir Gholami , Zhewei Yao , Sehoon Kim , Michael W. Mahoney , and Kurt Keutzer . Ai and memory wall . RiseLab Medium Post , 2021 . Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W. Mahoney, and Kurt Keutzer. Ai and memory wall. RiseLab Medium Post, 2021."},{"key":"e_1_3_2_2_17_1","volume-title":"Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs\/1811.02084","author":"Shazeer Noam","year":"2018","unstructured":"Noam Shazeer , Youlong Cheng , Niki Parmar , Dustin Tran , Ashish Vaswani , Penporn Koanantakool , Peter Hawkins , HyoukJoong Lee , Mingsheng Hong , Cliff Young , Ryan Sepassi , and Blake A. Hechtman . Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs\/1811.02084 , 2018 . Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs\/1811.02084, 2018."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303953"},{"key":"e_1_3_2_2_19_1","volume-title":"Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1--43","author":"Ben-Nun Tal","year":"2019","unstructured":"Tal Ben-Nun and Torsten Hoefler . Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1--43 , 2019 . Tal Ben-Nun and Torsten Hoefler. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR), 52(4):1--43, 2019."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378465"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_2_22_1","volume-title":"Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1--26","author":"Jin Hai","year":"2018","unstructured":"Hai Jin , Bo Liu , Wenbin Jiang , Yang Ma , Xuanhua Shi , Bingsheng He , and Shaofeng Zhao . Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1--26 , 2018 . Hai Jin, Bo Liu, Wenbin Jiang, Yang Ma, Xuanhua Shi, Bingsheng He, and Shaofeng Zhao. Layer-centric memory reuse and data migration for extreme-scale deep learning on many-core architectures. ACM Transactions on Architecture and Code Optimization (TACO), 15(3):1--26, 2018."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00057"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783721"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_2_2_27_1","volume-title":"Distributed hierarchical gpu parameter server for massive scale deep learning ads systems","author":"Zhao Weijie","year":"2020","unstructured":"Weijie Zhao , Deping Xie , Ronglai Jia , Yulei Qian , Ruiquan Ding , Mingming Sun , and Ping Li . Distributed hierarchical gpu parameter server for massive scale deep learning ads systems , 2020 . Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems, 2020."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00070"},{"key":"e_1_3_2_2_29_1","volume-title":"Training deep nets with sublinear memory cost. CoRR, abs\/1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen , Bing Xu , Chiyuan Zhang , and Carlos Guestrin . Training deep nets with sublinear memory cost. CoRR, abs\/1604.06174 , 2016 . Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs\/1604.06174, 2016."},{"key":"e_1_3_2_2_30_1","volume-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization. ArXiv, abs\/1910.02653","author":"Jain Paras","year":"2019","unstructured":"Paras Jain , Ajay Jain , Aniruddha Nrusimha , Amir Gholami , Pieter Abbeel , Kurt Keutzer , Ion Stoica , and Joseph E. Gonzalez . Checkmate: Breaking the memory wall with optimal tensor rematerialization. ArXiv, abs\/1910.02653 , 2019 . Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization. ArXiv, abs\/1910.02653, 2019."},{"key":"e_1_3_2_2_31_1","volume-title":"Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs\/1801.04380","author":"Wang Linnan","year":"2018","unstructured":"Linnan Wang , Jinmian Ye , Yiyang Zhao , Wei Wu , Ang Li , Shuaiwen Leon Song , Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs\/1801.04380 , 2018 . Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs\/1801.04380, 2018."},{"key":"e_1_3_2_2_32_1","volume-title":"July","author":"Duchi John","year":"2011","unstructured":"John Duchi , Elad Hazan , and Yoram Singer . Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):2121--2159 , July 2011 . John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):2121--2159, July 2011."},{"key":"e_1_3_2_2_33_1","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings","author":"Diederik","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors , 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings , 2015 . Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015."},{"key":"e_1_3_2_2_34_1","volume-title":"Scaling SGD batch size to 32k for imagenet training. CoRR, abs\/1708.03888","author":"You Yang","year":"2017","unstructured":"Yang You , Igor Gitman , and Boris Ginsburg . Scaling SGD batch size to 32k for imagenet training. CoRR, abs\/1708.03888 , 2017 . Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs\/1708.03888, 2017."},{"key":"e_1_3_2_2_35_1","volume-title":"Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs\/1904.00962","author":"You Yang","year":"2019","unstructured":"Yang You , Jing Li , Jonathan Hseu , Xiaodan Song , James Demmel , and Cho-Jui Hsieh . Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs\/1904.00962 , 2019 . Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs\/1904.00962, 2019."},{"key":"e_1_3_2_2_36_1","volume-title":"Mixed precision training","author":"Micikevicius Paulius","year":"2017","unstructured":"Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh , and Hao Wu . Mixed precision training , 2017 . Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2017."},{"key":"e_1_3_2_2_37_1","volume-title":"https:\/\/www.nvidia.com\/en-us\/data-center\/tensor-cores\/","author":"Tensor Cores NVIDIA","year":"2018","unstructured":"NVIDIA Tensor Cores . https:\/\/www.nvidia.com\/en-us\/data-center\/tensor-cores\/ , 2018 . [Online, accessed 5-April-2021]. NVIDIA Tensor Cores. https:\/\/www.nvidia.com\/en-us\/data-center\/tensor-cores\/, 2018. [Online, accessed 5-April-2021]."},{"key":"e_1_3_2_2_38_1","volume-title":"Attention is all you need. CoRR, abs\/1706.03762","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , and Illia Polosukhin . Attention is all you need. CoRR, abs\/1706.03762 , 2017 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs\/1706.03762, 2017."},{"key":"e_1_3_2_2_39_1","volume-title":"Generating long sequences with sparse transformers. CoRR, abs\/1904.10509","author":"Child Rewon","year":"2019","unstructured":"Rewon Child , Scott Gray , Alec Radford , and Ilya Sutskever . Generating long sequences with sparse transformers. CoRR, abs\/1904.10509 , 2019 . Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs\/1904.10509, 2019."},{"key":"e_1_3_2_2_40_1","volume-title":"Longformer: The long-document transformer. CoRR, abs\/2004.05150","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy , Matthew E. Peters , and Arman Cohan . Longformer: The long-document transformer. CoRR, abs\/2004.05150 , 2020 . Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. CoRR, abs\/2004.05150, 2020."},{"key":"e_1_3_2_2_41_1","unstructured":"turing-nlg: A 17-billion-parameter language model by microsoft.  turing-nlg: A 17-billion-parameter language model by microsoft."},{"key":"e_1_3_2_2_42_1","volume-title":"NVIDIA DGX SuperPOD delivers world record supercomputing to any enterprise. https:\/\/developer.nvidia.com\/blog\/dgx-superpod-world-record-supercomputing-enterprise\/","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. NVIDIA DGX SuperPOD delivers world record supercomputing to any enterprise. https:\/\/developer.nvidia.com\/blog\/dgx-superpod-world-record-supercomputing-enterprise\/ , 2019 . NVIDIA. NVIDIA DGX SuperPOD delivers world record supercomputing to any enterprise. https:\/\/developer.nvidia.com\/blog\/dgx-superpod-world-record-supercomputing-enterprise\/, 2019."},{"key":"e_1_3_2_2_43_1","volume-title":"Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704","author":"Li Shen","year":"2020","unstructured":"Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 , 2020 . Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.37"},{"key":"e_1_3_2_2_45_1","volume-title":"ORNL Launches Summit Supercomputer. https:\/\/www.ornl.gov\/news\/ornl-launches-summit-supercomputer","author":"Oak Ridge National Laboratory.","year":"2018","unstructured":"Oak Ridge National Laboratory. ORNL Launches Summit Supercomputer. https:\/\/www.ornl.gov\/news\/ornl-launches-summit-supercomputer , 2018 . [Online; accessed 08-April-2021]. Oak Ridge National Laboratory. ORNL Launches Summit Supercomputer. https:\/\/www.ornl.gov\/news\/ornl-launches-summit-supercomputer, 2018. [Online; accessed 08-April-2021]."}],"event":{"name":"SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St. Louis Missouri","acronym":"SC '21","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing","IEEE CS"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458817.3476205","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3458817.3476205","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:21Z","timestamp":1750191141000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3458817.3476205"}},"subtitle":["breaking the GPU memory wall for extreme scale deep learning"],"short-title":[],"issued":{"date-parts":[[2021,11,13]]},"references-count":45,"alternative-id":["10.1145\/3458817.3476205","10.1145\/3458817"],"URL":"https:\/\/doi.org\/10.1145\/3458817.3476205","relation":{},"subject":[],"published":{"date-parts":[[2021,11,13]]},"assertion":[{"value":"2021-11-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}