{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T07:13:35Z","timestamp":1753600415842,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":16,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T00:00:00Z","timestamp":1691107200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,8,6]]},"DOI":"10.1145\/3580305.3599573","type":"proceedings-article","created":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T18:10:58Z","timestamp":1691172658000},"page":"5821-5822","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Training Large-scale Foundation Models on Emerging AI Chips"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8657-0439","authenticated-orcid":false,"given":"Aashiq","family":"Muhamed","sequence":"first","affiliation":[{"name":"AWS AI Labs, Santa Clara, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0701-5868","authenticated-orcid":false,"given":"Christian","family":"Bock","sequence":"additional","affiliation":[{"name":"AWS AI Labs, Munich, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8176-4594","authenticated-orcid":false,"given":"Rahul","family":"Solanki","sequence":"additional","affiliation":[{"name":"AWS Neuron, Cupertino, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0970-9214","authenticated-orcid":false,"given":"Youngsuk","family":"Park","sequence":"additional","affiliation":[{"name":"AWS AI Labs, Santa Clara, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8165-840X","authenticated-orcid":false,"given":"Yida","family":"Wang","sequence":"additional","affiliation":[{"name":"AWS AIRE, Santa Clara, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7020-1604","authenticated-orcid":false,"given":"Jun","family":"Huan","sequence":"additional","affiliation":[{"name":"AWS AI Labs, Santa Clara, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,8,4]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020","author":"Tom","year":"2020","unstructured":"Tom B. Brown et al. 2020. Language models are few-shot learners . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 , NeurIPS 2020 , December 6-12, 2020, virtual. Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors. https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/1 457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Tom B. Brown et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors. https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/1 457c0d6bfcb4967418bfb8ac142f64a-Abstract.html."},{"key":"e_1_3_2_1_2_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 , Minneapolis, MN, USA , June 2-7, 2019, Volume 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio, editors. Association for Computational Linguistics, 4171--4186. doi: 10.18653\/v1\/n19-1423. 10.18653\/v1 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio, editors. Association for Computational Linguistics, 4171--4186. doi: 10.18653\/v1\/n19-1423."},{"key":"e_1_3_2_1_3_1","unstructured":"Yanping Huang et al. 2019. Gpipe: efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32.  Yanping Huang et al. 2019. Gpipe: efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32."},{"key":"e_1_3_2_1_4_1","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks","volume":"1","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia , Matei Zaharia , and Alex Aiken . 2019 . Beyond data and model parallelism for deep neural networks . Proceedings of Machine Learning and Systems , 1 , 1 -- 13 . Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1, 1--13.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_5_1","volume-title":"Roberta: A robustly optimized BERT pretraining approach. CoRR, abs\/1907.11692","author":"Yinhan Liu","year":"2019","unstructured":"Yinhan Liu et al. 2019 . Roberta: A robustly optimized BERT pretraining approach. CoRR, abs\/1907.11692 . http:\/\/arxiv.org\/abs\/1907.11692 arXiv: 1907.116 92. Yinhan Liu et al. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs\/1907.11692. http:\/\/arxiv.org\/abs\/1907.11692 arXiv: 1907.116 92."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1--15","author":"Deepak","key":"e_1_3_2_1_7_1","unstructured":"Deepak Narayanan et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1--15 . Deepak Narayanan et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1--15."},{"key":"e_1_3_2_1_8_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever etal [n. d.] Language models are unsupervised multitask learners.  Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. [n. d.] Language models are unsupervised multitask learners."},{"key":"e_1_3_2_1_9_1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J. Liu . 2020 . Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res. , 21 , 140:1--140:67. http:\/\/jmlr.org\/papers\/v21\/20-074.html. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21, 140:1--140:67. http:\/\/jmlr.org\/papers\/v21\/20-074.html.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_11_1","volume-title":"USENIX Annual Technical Conference, 551--564","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , and Yuxiong He . 2021 . Zero-offload: democratizing billion-scale model training . In USENIX Annual Technical Conference, 551--564 . Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: democratizing billion-scale model training. In USENIX Annual Technical Conference, 551--564."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_13_1","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. Distilbert a distilled version of BERT: smaller faster cheaper and lighter. CoRR abs\/1910.01108. http:\/\/arxiv.org\/abs\/1910.01108 arXiv: 1910.01108.  Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. Distilbert a distilled version of BERT: smaller faster cheaper and lighter. CoRR abs\/1910.01108. http:\/\/arxiv.org\/abs\/1910.01108 arXiv: 1910.01108."},{"key":"e_1_3_2_1_14_1","unstructured":"Yuanzhong Xu et al. 2021. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663.  Yuanzhong Xu et al. 2021. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Zhen Zhang Shuai Zheng Yida Wang Justin Chiu George Karypis Trishul Chilimbi Mu Li and Xin Jin. 2022. Mics: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119.  Zhen Zhang Shuai Zheng Yida Wang Justin Chiu George Karypis Trishul Chilimbi Mu Li and Xin Jin. 2022. Mics: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119.","DOI":"10.14778\/3561261.3561265"},{"key":"e_1_3_2_1_16_1","unstructured":"Yanli Zhao et al. 2023. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.  Yanli Zhao et al. 2023. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277."}],"event":{"name":"KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"],"location":"Long Beach CA USA","acronym":"KDD '23"},"container-title":["Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599573","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3580305.3599573","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:52Z","timestamp":1750178272000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599573"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,4]]},"references-count":16,"alternative-id":["10.1145\/3580305.3599573","10.1145\/3580305"],"URL":"https:\/\/doi.org\/10.1145\/3580305.3599573","relation":{},"subject":[],"published":{"date-parts":[[2023,8,4]]},"assertion":[{"value":"2023-08-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}