{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T19:53:57Z","timestamp":1765828437604,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T00:00:00Z","timestamp":1742428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>Due to the limited GPU memory, the performance of large DNNs training is constrained by the unscalable batch size. Existing studies partially address the issue of GPU memory limit through tensor recomputation and swapping, but overlook the exploration of optimal performance.<\/jats:p>\n          <jats:p>In response, we propose ATP, a recomputation and swapping based GPU memory management framework that aims to maximize training performance by breaking GPU memory constraints. ATP utilizes a throughput model and we propose to evaluate the theoretical peak performance achievable by DNN training on GPU, and provide the optimum memory size required for recomputation and swapping. We optimize the mechanisms for GPU memory pool and CUDA stream control, employ an optimization method to search for specific tensors requiring recomputation and swapping, thereby bringing the actual DNN training performance on ATP closer to theoretical values. Evaluations with different types of large DNN models indicate that ATP achieve throughput improvements ranging from 1.14\u223c 1.49\u00d7, while support model training exceeding the GPU memory limit by up to 9.2\u00d7.<\/jats:p>","DOI":"10.1145\/3701996","type":"journal-article","created":{"date-parts":[[2024,11,13]],"date-time":"2024-11-13T11:19:59Z","timestamp":1731496799000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5284-959X","authenticated-orcid":false,"given":"Weiduo","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Xi\u2019an Jiaotong University, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9003-2625","authenticated-orcid":false,"given":"Xiaoshe","family":"Dong","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xi\u2019an Jiaotong University, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0887-3328","authenticated-orcid":false,"given":"Fan","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xi\u2019an Jiaotong University, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8630-2324","authenticated-orcid":false,"given":"Bowen","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xi\u2019an Jiaotong University, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6403-4259","authenticated-orcid":false,"given":"Yufei","family":"Wang","sequence":"additional","affiliation":[{"name":"Information and Telecommunications Branch, State Grid Shaanxi Electric Power Company, Xi\u2019an, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9179-6611","authenticated-orcid":false,"given":"Qiang","family":"Wang","sequence":"additional","affiliation":[{"name":"Xi\u2019an Jiaotong University, Xi\u2019an, China"}]}],"member":"320","published-online":{"date-parts":[[2025,3,20]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2\u20134, 2016","author":"Keeton Kimberly","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2\u20134, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 265\u2013283."},{"key":"e_1_3_2_3_2","article-title":"Layer normalization","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).","journal-title":"arXiv preprint arXiv:1607.06450"},{"key":"e_1_3_2_4_2","article-title":"Training deep nets with sublinear memory cost","volume":"1604","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. CoRR abs\/1604.06174 (2016). arXiv:1604.06174http:\/\/arxiv.org\/abs\/1604.06174","journal-title":"CoRR"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-021-03759-8"},{"issue":"3","key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"646","DOI":"10.1109\/TPDS.2018.2866582","article-title":"moDNN: Memory optimal deep neural network training on graphics processing units","volume":"30","author":"Chen Xiaoming","year":"2018","unstructured":"Xiaoming Chen, Danny Ziyi Chen, Yinhe Han, and Xiaobo Sharon Hu. 2018. moDNN: Memory optimal deep neural network training on graphics processing units. IEEE Transactions on Parallel and Distributed Systems 30, 3 (2018), 646\u2013661.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_7_2","article-title":"A survey of model compression and acceleration for deep neural networks","author":"Cheng Yu","year":"2017","unstructured":"Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).","journal-title":"arXiv preprint arXiv:1710.09282"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_9_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2\u20137, 2019, Volume 1 (Long and Short Papers)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2\u20137, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171\u20134186."},{"key":"e_1_3_2_10_2","first-page":"630","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201916)","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV\u201916). 630\u2013645."},{"issue":"3","key":"e_1_3_2_11_2","first-page":"826","article-title":"HOME: A holistic GPU memory management framework for deep learning","volume":"72","year":"2023","unstructured":"Shuibing He, Ping Chen, Shuaiben Chen, Zheng Li, Siling Yang, Weijian Chen, and Lidan Shou. 2023. HOME: A holistic GPU memory management framework for deep learning. IEEE Trans. Computers 72, 3 (2023), 826\u2013838.","journal-title":"IEEE Trans. Computers"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_14_2","volume-title":"Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2\u20134, 2020","author":"Dhillon Inderjit S.","year":"2020","unstructured":"Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph Gonzalez. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2\u20134, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https:\/\/proceedings.mlsys.org\/book\/320.pdf"},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1145\/2647868.2654889","volume-title":"Proceedings of the 22nd ACM International Conference on Multimedia","year":"2014","unstructured":"Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675\u2013678."},{"key":"e_1_3_2_16_2","first-page":"65","volume-title":"Proceedings of the 11th ACM SIGPLAN\/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 14\u201315, 2015","author":"Kehne Jens","year":"2015","unstructured":"Jens Kehne, Jonathan Metter, and Frank Bellosa. 2015. GPUswap: Enabling oversubscription of GPU memory through transparent swapping. In Proceedings of the 11th ACM SIGPLAN\/SIGOPS International Conference on Virtual Execution Environments, Istanbul, Turkey, March 14\u201315, 2015, Ada Gavrilovska, Angela Demke Brown, and Bjarne Steensgaard (Eds.). ACM, 65\u201377."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3274957"},{"key":"e_1_3_2_18_2","first-page":"1724","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","year":"2014","unstructured":"Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). Association for Computational Linguistics, 1724\u20131734."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01247-4"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2022.12.004"},{"issue":"1","key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1007\/s10723-023-09646-1","article-title":"An intelligent framework for oversubscription management in CPU-GPU unified memory","volume":"21","author":"Long Xinjian","year":"2023","unstructured":"Xinjian Long, Xiangyang Gong, Bo Zhang, and Huiyang Zhou. 2023. An intelligent framework for oversubscription management in CPU-GPU unified memory. J. Grid Comput. 21, 1 (2023), 11.","journal-title":"J. Grid Comput."},{"volume-title":"6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\u2013May 3, 2018, Conference Track Proceedings","year":"2018","key":"e_1_3_2_22_2","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garc\u00eda, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30\u2013May 3, 2018, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=r1gs9JgRZ"},{"key":"e_1_3_2_23_2","volume-title":"CUDA Toolkit","author":"Corporation NVIDIA","year":"2019","unstructured":"NVIDIA Corporation. 2019. CUDA Toolkit. NVIDIA Corporation. https:\/\/developer.nvidia.com\/cuda-toolkit"},{"key":"e_1_3_2_24_2","volume-title":"cuDNN: CUDA Deep Neural Network library","author":"Corporation NVIDIA","year":"2020","unstructured":"NVIDIA Corporation. 2020. cuDNN: CUDA Deep Neural Network library. NVIDIA Corporation. https:\/\/developer.nvidia.com\/cudnn"},{"first-page":"304","volume-title":"IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019, Madison, WI, USA, March 24\u201326, 2019","key":"e_1_3_2_25_2","unstructured":"Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2019, Madison, WI, USA, March 24\u201326, 2019. IEEE, 304\u2013315."},{"key":"e_1_3_2_26_2","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada","author":"Wallach Hanna M.","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et\u00a0al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d\u2019Alch\u00e9-Buc, Emily B. Fox, and Roman Garnett (Eds.). 8024\u20138035."},{"key":"e_1_3_2_27_2","doi-asserted-by":"crossref","first-page":"891","DOI":"10.1145\/3373376.3378505","volume-title":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","year":"2020","unstructured":"Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-based GPU memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 891\u2013905."},{"key":"e_1_3_2_28_2","article-title":"Improving language understanding by generative pretraining","author":"Radford Alec","year":"2018","unstructured":"Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pretraining. arXiv preprint arXiv:1801.06146 (2018).","journal-title":"arXiv preprint arXiv:1801.06146"},{"key":"e_1_3_2_29_2","first-page":"1","volume-title":"2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","year":"2016","unstructured":"Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1\u201313."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-021-00218-1"},{"key":"e_1_3_2_32_2","first-page":"3168","volume-title":"39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3\u20137, 2023","year":"2023","unstructured":"Luming Sun, Shijin Gong, Tieying Zhang, Fuxin Jiang, Zhibing Zhao, Jianjun Chen, and Xinyu Zhang. 2023. SUFS: A generic storage usage forecasting service through adaptive ensemble learning. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3\u20137, 2023. IEEE, 3168\u20133181."},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/3178487.3178491","volume-title":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","year":"2018","unstructured":"Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. SuperNeurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 41\u201353."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_2_35_2","first-page":"1593","volume-title":"Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3\u20137, 2017","author":"Wrede Fabian","year":"2017","unstructured":"Fabian Wrede and Vincent von Hof. 2017. Enabling efficient use of algorithmic skeletons in cloud environments: Container-based virtualization for hybrid CPU-GPU execution of data-parallel skeletons. In Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3\u20137, 2017, Ahmed Seffah, Birgit Penzenstadler, Carina Alves, and Xin Peng (Eds.). ACM, 1593\u20131596."},{"key":"e_1_3_2_36_2","first-page":"1222","volume-title":"SAC\u201920: The 35th ACM\/SIGAPP Symposium on Applied Computing, March 30\u2013April 3, 2020","author":"Yang Su-Wei","year":"2020","unstructured":"Su-Wei Yang, Zhao-Wei Qiu, and Ya-Shu Chen. 2020. GPU swap-aware scheduler: Virtual memory management for GPU applications. In SAC\u201920: The 35th ACM\/SIGAPP Symposium on Applied Computing, March 30\u2013April 3, 2020. ACM, 1222\u20131227."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2022.3221426"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701996","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3701996","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:57:16Z","timestamp":1750298236000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701996"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,20]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3701996"],"URL":"https:\/\/doi.org\/10.1145\/3701996","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,3,20]]},"assertion":[{"value":"2024-02-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}