{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T17:38:26Z","timestamp":1770917906843,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,14]],"date-time":"2023-12-14T00:00:00Z","timestamp":1702512000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62172327"],"award-info":[{"award-number":["62172327"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps), which is also time-consuming, is often transferred using traditional serial, long-I\/O-path transfer methods.<\/jats:p>\n          <jats:p>In this article, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between the NVMe SSDs and GPUs. To achieve higher tensor data I\/O throughput, we optimized the traditional data I\/O process. We also proposed a data and runtime context-aware tensor I\/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor\u2019s unified interface to test the read\/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context.<\/jats:p>\n          <jats:p>\n            We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to\n            <jats:italic>torch.save<\/jats:italic>\n            () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I\/O API.\n          <\/jats:p>","DOI":"10.1145\/3630108","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T21:37:02Z","timestamp":1698269822000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Fastensor: Optimise the Tensor I\/O Path from SSD to GPU for Deep Learning Training"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2234-0378","authenticated-orcid":false,"given":"Jia","family":"Wei","sequence":"first","affiliation":[{"name":"Xi\u2019an Jiaotong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1434-7016","authenticated-orcid":false,"given":"Xingjun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Xi\u2019an Jiaotong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2005-114X","authenticated-orcid":false,"given":"Longxiang","family":"Wang","sequence":"additional","affiliation":[{"name":"Xi\u2019an Jiaotong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2293-5427","authenticated-orcid":false,"given":"Zheng","family":"Wei","sequence":"additional","affiliation":[{"name":"Xi\u2019an Jiaotong University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,14]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00030"},{"key":"e_1_3_1_3_2","first-page":"387","volume-title":"Proceedings of the 19th USENIX Conference on File and Storage Technologies","author":"Bae Jonghyun","year":"2021","unstructured":"Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. In Proceedings of the 19th USENIX Conference on File and Storage Technologies, Marcos K. Aguilera and Gala Yadgar (Eds.). USENIX Association, 387\u2013401."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3309987"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/Cluster48925.2021.00019"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2866582"},{"key":"e_1_3_1_7_2","doi-asserted-by":"crossref","unstructured":"Yiming Cui Wanxiang Che Ting Liu Bing Qin and Ziqing Yang. 2021. Pre-training with whole word masking for chinese bert. IEEE\/ACM Transactions on Audio Speech and Language Processing 29 (2021) 3504\u20133514.","DOI":"10.1109\/TASLP.2021.3124365"},{"key":"e_1_3_1_8_2","volume-title":"Proceedings of the 9th International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations. OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2020.3048577"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2022.117904"},{"key":"e_1_3_1_11_2","unstructured":"Thomas Elsken Jan Hendrik Metzen and Frank Hutter. 2021. Neural architecture search: A survey. The Journal of Machine Learning Research 20 1 (2021) 1997\u20132017."},{"key":"e_1_3_1_12_1","unstructured":"Trevor Gale Przemek Tredak Simon Layton and et\u00a0al. 2023. NVIDIA DALI. Released on March 15 2023. [Online]. Available: https:\/\/github.com\/NVIDIA\/DALI. Accessed on May 1 2023."},{"key":"e_1_3_1_13_2","first-page":"689","volume-title":"Proceedings of the 2022 USENIX Annual Technical Conference","author":"Graur Dan","year":"2022","unstructured":"Dan Graur, Damien Aymon, Dan Kluser, and Tanguy Albrici.2022. Cachew: Machine learning input data processing as a service. In Proceedings of the 2022 USENIX Annual Technical Conference. 689\u2013706."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378465"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00070"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Daniel Kang Ankit Mathur Teja Veeramacheneni Matei Zaharia and Peter Bailis.2020. Jointly optimizing preprocessing and inference for DNN-based visual analytics. Proceedings of the VLDB Endowment 14 2 (2020) 87\u2013100.","DOI":"10.14778\/3425879.3425881"},{"key":"e_1_3_1_20_2","volume-title":"Proceedings of the 3rd International Conference on Learning Representations","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. InProceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_21_2","first-page":"33","article-title":"Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines","volume":"4","author":"Kuchnik Michael","year":"2022","unstructured":"Michael Kuchnik, Ana Klimovic, Jiri Simsa, George Amvrosiadis, and Virginia Smith. 2022. Plumber: Diagnosing and removing performance bottlenecks in machine learning data pipelines. Proceedings of Machine Learning and Systems 4 (2022), 33\u201351.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539070"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480122"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"414","DOI":"10.1109\/SC.2018.00035","volume-title":"Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Markthub Pak","year":"2018","unstructured":"Pak Markthub, Mehmet E Belviranli, Seyong Lee,Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU memory capacity limits with direct NVM access. In Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 414\u2013426."},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Jayashree Mohan Amar Phanishayee Ashish Raniwala and Vijay Chidambaram. 2021. Analyzing and mitigating data stalls in DNN training. Proceedings of the VLDB Endowment 14 5 (2021) 771\u2013784.","DOI":"10.14778\/3446095.3446100"},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Derek G. Murray Jiri Simsa Ana Klimovic and Ihor Indyk. 2021. tf. data: A machine learning data processing framework. Proceedings of the VLDB Endowment 14 12 (2021) 2945\u20132958.","DOI":"10.14778\/3476311.3476374"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n18-1202"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783721"},{"key":"e_1_3_1_33_2","volume-title":"Proceedings of the 3rd International Conference on Learning Representations","author":"Simonyan K.","year":"2015","unstructured":"K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_1_34_2","article-title":"Training deep nets with sublinear memory cost","author":"Tianqi C.","year":"2016","unstructured":"C. Tianqi, X. Bing, Z. Chiyuan, and G. Carlos. 2016. Training deep nets with sublinear memory cost. arXiv:1604.06174. Retrieved from https:\/\/arxiv.org\/abs\/1604.06174","journal-title":"arXiv:1604.06174"},{"key":"e_1_3_1_35_2","unstructured":"Chien-Yao Wang Alexey Bochkovskiy and Hong-Yuan Mark Liao. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition ."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00303"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSAC.2022.3192050"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.5244\/C.30.87"},{"key":"e_1_3_1_40_2","article-title":"Adadelta: An adaptive learning rate method","author":"Zeiler M. D.","year":"2012","unstructured":"M. D. Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv:1212.5701. Retrieved from https:\/\/arxiv.org\/abs\/1212.5701","journal-title":"arXiv:1212.5701"},{"key":"e_1_3_1_41_2","volume-title":"Proceedings of the 5th International Conference on Learning Representations","author":"Zoph Barret","year":"2017","unstructured":"Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=r1Ue8Hcxg"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630108","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3630108","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:08Z","timestamp":1750291028000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630108"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,14]]},"references-count":40,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3630108"],"URL":"https:\/\/doi.org\/10.1145\/3630108","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,14]]},"assertion":[{"value":"2023-05-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}