{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T16:02:52Z","timestamp":1774540972610,"version":"3.50.1"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>\n            Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the\n            <jats:italic toggle=\"yes\">Page<\/jats:italic>\n            abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I\/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.\n          <\/jats:p>","DOI":"10.14778\/3611540.3611564","type":"journal-article","created":{"date-parts":[[2023,9,15]],"date-time":"2023-09-15T11:32:37Z","timestamp":1694777557000},"page":"3781-3794","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Angel-PTM: A Scalable and Economical Large-Scale Pre-Training System in Tencent"],"prefix":"10.14778","volume":"16","author":[{"given":"Xiaonan","family":"Nie","sequence":"first","affiliation":[{"name":"Peking University"}]},{"given":"Yi","family":"Liu","sequence":"additional","affiliation":[{"name":"Tencent Inc."}]},{"given":"Fangcheng","family":"Fu","sequence":"additional","affiliation":[{"name":"Peking University"}]},{"given":"Jinbao","family":"Xue","sequence":"additional","affiliation":[{"name":"Tencent Inc."}]},{"given":"Dian","family":"Jiao","sequence":"additional","affiliation":[{"name":"Tencent Inc."}]},{"given":"Xupeng","family":"Miao","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}]},{"given":"Yangyu","family":"Tao","sequence":"additional","affiliation":[{"name":"Tencent Inc."}]},{"given":"Bin","family":"Cui","sequence":"additional","affiliation":[{"name":"Peking University"}]}],"member":"320","published-online":{"date-parts":[[2023,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2022. ChatGPT: Optimizing Language Models for Dialogue. https:\/\/openai.com\/blog\/chatgpt\/."},{"key":"e_1_2_1_2_1","unstructured":"2022. CLUE Benchmark. https:\/\/www.cluebenchmarks.com\/rank.html."},{"key":"e_1_2_1_3_1","unstructured":"2022. Overall DGX A100 System Architecture. https:\/\/www.microway.com\/hpc-tech-tips\/dgx-a100-review-throughput-and-hardware-summary\/."},{"key":"e_1_2_1_4_1","volume-title":"TensorFlow: A System for Large-Scale Machine Learning. In USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 265--283","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 265--283."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123975"},{"key":"e_1_2_1_6_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E. Hinton","author":"Ba Lei Jimmy","year":"2016","unstructured":"Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_2_1_7_1","unstructured":"Rishi Bommasani Drew A Hudson Ehsan Adeli Russ Altman Simran Arora Sydney von Arx Michael S Bernstein Jeannette Bohg Antoine Bosselut Emma Brunskill et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)."},{"key":"e_1_2_1_8_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_2_1_9_1","volume-title":"Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00156-2"},{"key":"e_1_2_1_11_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In The North American","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In The North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 4171--4186."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462506"},{"key":"e_1_2_1_13_1","volume-title":"International Conference on Learning Representations. OpenReview.net.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. OpenReview.net."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3219819"},{"key":"e_1_2_1_15_1","first-page":"1","article-title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","volume":"23","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1--39.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_16_1","volume-title":"International Conference on Machine Learning","volume":"119","author":"Fu Fangcheng","year":"2020","unstructured":"Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. 2020. Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript. In International Conference on Machine Learning, Vol. 119. PMLR, 3304--3314."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3547305.3547316"},{"key":"e_1_2_1_18_1","volume-title":"International Conference on Management of Data. ACM, 563--576","author":"Fu Fangcheng","year":"2021","unstructured":"Fangcheng Fu, Yingxia Shao, Lele Yu, Jiawei Jiang, Huanran Xue, Yangyu Tao, and Bin Cui. 2021. VF2Boost: Very Fast Vertical Federated Gradient Boosting for Cross-Enterprise Learning. In International Conference on Management of Data. ACM, 563--576."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-021-1350-8"},{"key":"e_1_2_1_20_1","unstructured":"Amir Gholami Zhewei Yao Sehoon Kim Michael W Mahoney and Kurt Keutzer. 2021. AI and Memory Wall."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_22_1","volume-title":"Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415","author":"Hendrycks Dan","year":"2016","unstructured":"Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)."},{"key":"e_1_2_1_23_1","volume-title":"Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019), 103--112."},{"key":"e_1_2_1_24_1","volume-title":"GPU Technology Conference (GTC)","volume":"2","author":"Jeaugey Sylvain","year":"2017","unstructured":"Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-021-3367-y"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196894"},{"key":"e_1_2_1_27_1","volume-title":"OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning. arXiv preprint arXiv:2209.13258","author":"Jiang Youhe","year":"2023","unstructured":"Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. 2023. OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning. arXiv preprint arXiv:2209.13258 (2023)."},{"key":"e_1_2_1_28_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)."},{"key":"e_1_2_1_29_1","volume-title":"Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.","author":"Diederik","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations."},{"key":"e_1_2_1_30_1","volume-title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.","author":"Lepikhin Dmitry","year":"2021","unstructured":"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2640087.2644155"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452773"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3581-9"},{"key":"e_1_2_1_35_1","volume-title":"Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. arXiv preprint arXiv:2211.13878","author":"Miao Xupeng","year":"2022","unstructured":"Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. arXiv preprint arXiv:2211.13878 (2022)."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489511"},{"key":"e_1_2_1_37_1","volume-title":"Mixed Precision Training. In International Conference on Learning Representations. OpenReview.net.","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In International Conference on Learning Representations. OpenReview.net."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_1_39_1","volume-title":"Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv preprint arXiv:2112.14397","author":"Nie Xiaonan","year":"2021","unstructured":"Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, and Bin Cui. 2021. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv preprint arXiv:2112.14397 (2021)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588964"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00241"},{"key":"e_1_2_1_42_1","unstructured":"Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 (2022)."},{"key":"e_1_2_1_43_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/s41019-021-00155-3"},{"key":"e_1_2_1_45_1","volume-title":"Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_2_1_46_1","first-page":"1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21 (2020), 1--67.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_47_1","volume-title":"International Conference on Machine Learning. PMLR","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. In International Conference on Machine Learning. PMLR, 18332--18346."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_2_1_50_1","volume-title":"USENIX Annual Technical Conference. USENIX Association, 551--564","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. USENIX Association, 551--564."},{"key":"e_1_2_1_51_1","first-page":"8583","article-title":"Scaling vision with sparse mixture of experts","volume":"34","author":"Riquelme Carlos","year":"2021","unstructured":"Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr\u00e9 Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583--8595.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_1_53_1","unstructured":"Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari Jared Casper Zhun Liu Shrimai Prabhumoye George Zerveas Vijay Korthikanti et al. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022)."},{"key":"e_1_2_1_54_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-022-3557-9"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/GreenCom-CPSCom.2010.102"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732967.2732976"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_2_1_59_1","volume-title":"Improving Automatic Parallel Training via Balanced Memory Workload Optimization. arXiv preprint arXiv:2307.02031","author":"Wang Yujie","year":"2023","unstructured":"Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Xiaonan Nie, and Bin Cui. 2023. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. arXiv preprint arXiv:2307.02031 (2023)."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054345"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-022-2140-7"},{"key":"e_1_2_1_62_1","volume-title":"Yao Shu, Bingsheng He, and Wei Wang.","author":"Zhang Junzhe","year":"2019","unstructured":"Junzhe Zhang, Sai Ho Yeung, Yao Shu, Bingsheng He, and Wei Wang. 2019. Efficient memory management for gpu-based deep learning systems. arXiv preprint arXiv:1903.06631 (2019)."},{"key":"e_1_2_1_63_1","volume-title":"USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 559--578","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and {Intra-Operator} Parallelism for Distributed Deep Learning. In USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 559--578."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3611540.3611564","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T22:34:32Z","timestamp":1757543672000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3611540.3611564"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":63,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.14778\/3611540.3611564"],"URL":"https:\/\/doi.org\/10.14778\/3611540.3611564","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,8]]},"assertion":[{"value":"2023-08-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}