{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T19:53:55Z","timestamp":1765828435379,"version":"3.41.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T00:00:00Z","timestamp":1731974400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62025208"],"award-info":[{"award-number":["62025208"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Xiangjiang Laboratory Fund","award":["22XJ01012"],"award-info":[{"award-number":["22XJ01012"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>To accommodate the increasingly large-scale models within limited-capacity GPU memory, various coarse-grained techniques, such as recomputation and swapping, have been proposed to optimize memory usage. However, these methods have encountered limitations, either in terms of inefficient memory reduction or diminished training performance. In response to this, our article introduces dynamic tensor offloading and recomputation (DELTA), an innovative approach for memory-efficient large-scale model training that combines fine-grained memory optimization and prefetching technology to reduce memory usage while maintaining high training throughput concurrently. Initially, we formulate the problem of memory-throughput joint optimization as an easy-solving 0\/1 Knapsack problem. Leveraging this formalization, we use an improving polynomial complexity heuristic algorithm to address the problem effectively. Furthermore, we introduce, to the best of our knowledge, a novel bidirectional prefetching technology into dynamic memory management that significantly accelerates the model training when compared to relying solely on recomputation or swapping. Finally, DELTA offers users an automated training execution library, eliminating the need for manual configuration or specialized expertise. Experimental results demonstrate the effectiveness of DELTA in reducing GPU memory consumption. Compared to state-of-the-art methods, DELTA achieves substantial memory savings ranging from 40% to 72%, while maintaining comparable convergence performance for various models, including ResNet-50, ResNet-101, and BERT-Large. Notably, DELTA enables the training of GPT2-Large and GPT2-XL with batch sizes increased by 5.5\u00d7 and 6\u00d7, respectively, showcasing its versatility and practicality in enabling large-scale model training on GPU hardware.<\/jats:p>","DOI":"10.1145\/3689338","type":"journal-article","created":{"date-parts":[[2024,8,20]],"date-time":"2024-08-20T11:25:15Z","timestamp":1724153115000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8595-1547","authenticated-orcid":false,"given":"Yu","family":"Tang","sequence":"first","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4579-4268","authenticated-orcid":false,"given":"Qiao","family":"Li","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1494-2853","authenticated-orcid":false,"given":"Lujia","family":"Yin","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9743-2034","authenticated-orcid":false,"given":"Dongsheng","family":"Li","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6450-8485","authenticated-orcid":false,"given":"Yiming","family":"Zhang","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8122-498X","authenticated-orcid":false,"given":"Chenyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Sensetime, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8525-0608","authenticated-orcid":false,"given":"Xingcheng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Artificial Intelligence Laboratory, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8285-2738","authenticated-orcid":false,"given":"Linbo","family":"Qiao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7518-1385","authenticated-orcid":false,"given":"Zhaoning","family":"Zhang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6378-7002","authenticated-orcid":false,"given":"Kai","family":"Lu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,11,19]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Amir Gholami Zhewei Yao Sehoon Kim Michael W. Mahoney and Kurt Keutzer. 2021. AI and memory wall. https:\/\/medium.com\/riselab\/ai-and-memory-wall-2cb4265cb0b8. Accessed March 29 2021."},{"key":"e_1_3_1_3_2","article-title":"Efficient combination of rematerialization and offloading for training DNNs","volume":"34","author":"Beaumont Olivier","year":"2021","unstructured":"Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2021. Efficient combination of rematerialization and offloading for training DNNs. In Advances in Neural Information Processing Systems, Vol. 34. ACM, New York, NY.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"5","key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"1518","DOI":"10.1109\/TPDS.2016.2616314","article-title":"The Unicorn Runtime: Efficient distributed shared memory programming for hybrid CPU-GPU clusters","volume":"28","author":"Beri Tarun","year":"2016","unstructured":"Tarun Beri, Sorav Bansal, and Subodh Kumar. 2016. The Unicorn Runtime: Efficient distributed shared memory programming for hybrid CPU-GPU clusters. IEEE Trans. Parallel Distrib. Syst. 28, 5 (2016), 1518\u20131534.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_5_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, Vol. 33. ACM, New York, NY, 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00591"},{"key":"e_1_3_1_7_2","unstructured":"Cbc. 2023. COIN-OR Branch-and-Cut Solver. Retrieved February 5 from https:\/\/github.com\/coin-or\/Cbc Last accessed on 2024-02-05."},{"key":"e_1_3_1_8_2","article-title":"Training deep nets with sublinear memory cost","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv:1604.06174. Retrieved from https:\/\/arxiv.org\/abs\/1604.06174","journal-title":"arXiv:1604.06174"},{"key":"e_1_3_1_9_2","first-page":"625","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201922)","author":"Choi Sangjin","year":"2022","unstructured":"Sangjin Choi, Taeksoo Kim, Jinwoo Jeong, Rachata Ausavarungnirun, Myeongjae Jeon, Youngjin Kwon, and Jeongseob Ahn. 2022. Memory harvesting in Multi-GPU systems with hierarchical unified virtual memory. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201922). 625\u2013638."},{"key":"e_1_3_1_10_2","volume-title":"International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et\u00a0al. 2020. An image is worth 16 \\(\\times\\) 16 words: Transformers for image recognition at scale. In International Conference on Learning Representations."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1016\/0377-2217(87)90165-2"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3417050"},{"key":"e_1_3_1_13_2","unstructured":"gurobi. 2024. gurobi. Retrieved February 5 2024 from https:\/\/www.gurobi.com\/"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3152247"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_1_17_2","first-page":"448","volume-title":"International Conference on Machine Learning","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. PMLR, 448\u2013456."},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Hiroaki Ishii Toshihide Ibaraki and Hisashi Mine. 1977. Fractional knapsack problems. Mathematical Programming 13 (1977) 255\u2013271.","DOI":"10.1007\/BF01584342"},{"key":"e_1_3_1_19_2","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. In Proceedings of Machine Learning and Systems, Vol. 2 (2020), 497\u2013511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_20_2","first-page":"4171","volume-title":"Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919)","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT\u201919). 4171\u20134186."},{"key":"e_1_3_1_21_2","volume-title":"ICLR (Poster)","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster)."},{"key":"e_1_3_1_22_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kirisame Marisa","year":"2020","unstructured":"Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic tensor rematerialization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_23_2","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 25. ACM, New York, NY, 1097\u20131105.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_24_2","article-title":"Efficient rematerialization for deep networks","volume":"32","author":"Kumar Ravi","year":"2019","unstructured":"Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. 2019. Efficient rematerialization for deep networks. In Advances in Neural Information Processing Systems, Vol. 32. ACM, New York, NY.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3247001"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.2172\/1761619"},{"issue":"12","key":"e_1_3_1_27_2","article-title":"PyTorch distributed: Experiences on accelerating data parallel training","volume":"13","author":"Li Shen","unstructured":"Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et\u00a0al. [n.d.]. PyTorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 13, 12 ([n.d.]).","journal-title":"Proc. VLDB Endow."},{"key":"e_1_3_1_28_2","volume-title":"International Conference on Learning Representations","author":"Lin Yujun","year":"2018","unstructured":"Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Brady D. Lund and Ting Wang. 2023. Chatting about ChatGPT: How may AI and GPT impact academia and libraries? Library Hi Tech News 40 3 (2023) 26\u201329.","DOI":"10.1108\/LHTN-01-2023-0009"},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","first-page":"647","DOI":"10.1007\/978-3-540-68279-0_17","article-title":"Symmetry in integer linear programming","author":"Margot Fran\u00e7ois","year":"2010","unstructured":"Fran\u00e7ois Margot. 2010. Symmetry in integer linear programming. In 50 Years of Integer Programming 1958\u20132008 (2010), 647\u2013686.","journal-title":"50 Years of Integer Programming 1958\u20132008"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-0208(08)73237-7"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_1_34_2","first-page":"2615","volume-title":"Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE\u201922)","author":"Nie Xiaonan","year":"2022","unstructured":"Xiaonan Nie, Xupeng Miao, Zhi Yang, and Bin Cui. 2022. TSPLIT: Fine-grained GPU memory management for efficient DNN training via tensor splitting. In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE\u201922). IEEE, 2615\u20132628."},{"key":"e_1_3_1_35_2","unstructured":"Nvidia. 2020. cudnn 8.0. Retrieved February 5 2024 from https:\/\/developer.nvidia.com\/rdp\/cudnn-archive"},{"key":"e_1_3_1_36_2","unstructured":"Nvidia. 2022. CUDA Toolkit 11.2. Retrieved February 5 2024 from https:\/\/developer.nvidia.com\/cuda-toolkit"},{"key":"e_1_3_1_37_2","first-page":"8026","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et\u00a0al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32, ACM, New York, NY, 8026\u20138037.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_38_2","first-page":"17573","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Patil Shishir G.","year":"2022","unstructured":"Shishir G. Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. Poet: Training neural networks on tiny devices with integrated rematerialization and paging. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 17573\u201317583."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_1_40_2","article-title":"Training large neural networks with constant memory using a new execution algorithm","author":"Pudipeddi Bharadwaj","year":"2020","unstructured":"Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. 2020. Training large neural networks with constant memory using a new execution algorithm. arXiv:2002.05645. Retrieved from https:\/\/arxiv.org\/abs\/2002.05645","journal-title":"arXiv:2002.05645"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78295-7_2"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_1_45_2","first-page":"551","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201921)","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale model training. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201921). 551\u2013564."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783721"},{"issue":"2","key":"e_1_3_1_47_2","article-title":"Turing-NLG: A 17-billion-parameter language model by Microsoft","volume":"1","author":"Rosset Corby","year":"2020","unstructured":"Corby Rosset. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. Microsoft Blog 1, 2 (2020).","journal-title":"Microsoft Blog"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/321864.321873"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1002\/nav.3800220110"},{"issue":"4","key":"e_1_3_1_50_2","first-page":"1875","article-title":"Nonparametric regression using deep neural networks with ReLU activation function","volume":"48","author":"Schmidt-Hieber Johannes","year":"2020","unstructured":"Johannes Schmidt-Hieber. 2020. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48, 4 (2020), 1875\u20131897.","journal-title":"Ann. Stat."},{"issue":"1","key":"e_1_3_1_51_2","first-page":"1","article-title":"Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments","volume":"20","author":"Schuler Manuela","year":"2022","unstructured":"Manuela Schuler, Richard Membarth, and Philipp Slusallek. 2022. Xengine: Optimal tensor rematerialization for neural networks in heterogeneous environments. ACM Trans. Arch. Code Optimiz. 20, 1 (2022), 1\u201325.","journal-title":"ACM Trans. Arch. Code Optimiz."},{"key":"e_1_3_1_52_2","article-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053. Retrieved from https:\/\/arxiv.org\/abs\/1909.08053","journal-title":"arXiv:1909.08053"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"e_1_3_1_54_2","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. ACM, New York, NY, 5998\u20136008."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3506705"},{"key":"e_1_3_1_57_2","volume-title":"Proceedings of the 37th Conference on Neural Information Processing Systems","author":"Wang Yuzhong","year":"2023","unstructured":"Yuzhong Wang, Xu Han, Weilin Zhao, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. 2023. H3T: Efficient integration of memory optimization and parallelism for large-scale transformer training. In Proceedings of the 37th Conference on Neural Information Processing Systems."},{"key":"e_1_3_1_58_2","article-title":"H3T: Efficient integration of memory optimization and parallelism for large-scale transformer training","volume":"36","author":"Wang Yuzhong","year":"2024","unstructured":"Yuzhong Wang, Xu Han, Weilin Zhao, Guoyang Zeng, Zhiyuan Liu, and Maosong Sun. 2024. H3T: Efficient integration of memory optimization and parallelism for large-scale transformer training. In Advances in Neural Information Processing Systems, Vol. 36. ACM, New York, NY.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_59_2","article-title":"Fastensor: Optimise the tensor I\/O path from SSD to GPU for deep learning training","author":"Wei Jia","year":"2023","unstructured":"Jia Wei, Xingjun Zhang, Longxiang Wang, and Zheng Wei. 2023. Fastensor: Optimise the tensor I\/O path from SSD to GPU for deep learning training. ACM Trans. Arch. Code Optim. (2023).","journal-title":"ACM Trans. Arch. Code Optim."},{"key":"e_1_3_1_60_2","volume-title":"Proceedings of the 37th Conference on Neural Information Processing Systems","author":"Zhang Jianhao","year":"2023","unstructured":"Jianhao Zhang, Shihan Ma, Peihong Liu, and Jinhui Yuan. 2023. Coop: Memory is not a Commodity. In Proceedings of the 37th Conference on Neural Information Processing Systems."},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477603"},{"key":"e_1_3_1_62_2","article-title":"Tbd: Benchmarking and analyzing deep neural network training","author":"Zhu Hongyu","year":"2018","unstructured":"Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Tbd: Benchmarking and analyzing deep neural network training. arXiv:1803.06905. Retrieved from https:\/\/arxiv.org\/abs\/1803.06905","journal-title":"arXiv:1803.06905"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689338","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3689338","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:45Z","timestamp":1750291545000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689338"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,19]]},"references-count":61,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3689338"],"URL":"https:\/\/doi.org\/10.1145\/3689338","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2024,11,19]]},"assertion":[{"value":"2024-02-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-06","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}