{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T00:15:25Z","timestamp":1775607325455,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":71,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,28]],"date-time":"2023-10-28T00:00:00Z","timestamp":1698451200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["CCF-2107470"],"award-info":[{"award-number":["CCF-2107470"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,28]]},"DOI":"10.1145\/3613424.3614309","type":"proceedings-article","created":{"date-parts":[[2023,12,8]],"date-time":"2023-12-08T17:22:15Z","timestamp":1702056135000},"page":"395-410","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-9877-411X","authenticated-orcid":false,"given":"Haoyang","family":"Zhang","sequence":"first","affiliation":[{"name":"University of Illinois Urbana-Champaign, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6195-2269","authenticated-orcid":false,"given":"Yirui","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0363-9486","authenticated-orcid":false,"given":"Yuqi","family":"Xue","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8171-4970","authenticated-orcid":false,"given":"Yiqi","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1125-671X","authenticated-orcid":false,"given":"Jian","family":"Huang","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,8]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"[n. d.]. PCIe 3.0 Specification. https:\/\/pcisig.com\/specifications.  [n. d.]. PCIe 3.0 Specification. https:\/\/pcisig.com\/specifications."},{"key":"e_1_3_2_1_2_1","volume-title":"Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Martin","year":"2016","unstructured":"Martin Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek\u00a0Gordon Murray , Benoit Steiner , Paul\u00a0 A. Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , and Xiaoqiang Zhang . 2016 . TensorFlow: A System for Large-Scale Machine Learning . In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916) . Savannah, GA. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek\u00a0Gordon Murray, Benoit Steiner, Paul\u00a0A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zhang. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). Savannah, GA."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304061"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694381"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1404014.1404019"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3480855"},{"key":"e_1_3_2_1_7_1","unstructured":"[\n  7\n  ]  AMD DirectGMA. [n. d.]. https:\/\/www.bitflow.com\/technology\/directgma\/.  [7] AMD DirectGMA. [n. d.]. https:\/\/www.bitflow.com\/technology\/directgma\/."},{"key":"e_1_3_2_1_8_1","unstructured":"[\n  8\n  ]  AMD High Bandwidth Memory. [n. d.]. https:\/\/www.amd.com\/en\/technologies\/hbm.  [8] AMD High Bandwidth Memory. [n. d.]. https:\/\/www.amd.com\/en\/technologies\/hbm."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123975"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2018.00024"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2524211.2524221"},{"key":"e_1_3_2_1_12_1","volume-title":"FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21)","author":"Bae Jonghyun","year":"2021","unstructured":"Jonghyun Bae , Jongsung Lee , Yunho Jin , Sam Son , Shine Kim , Hakbeom Jang , Tae\u00a0Jun Ham , and Jae\u00a0 W. Lee . 2021 . FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21) . USENIX Association, 387\u2013401. https:\/\/www.usenix.org\/conference\/fast21\/presentation\/bae Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae\u00a0Jun Ham, and Jae\u00a0W. Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 387\u2013401. https:\/\/www.usenix.org\/conference\/fast21\/presentation\/bae"},{"key":"e_1_3_2_1_13_1","volume-title":"Advances in Neural Information Processing Systems, H.\u00a0Larochelle, M.\u00a0Ranzato, R.\u00a0Hadsell, M.F. Balcan, and H.\u00a0Lin (Eds.). Vol.\u00a033. Curran Associates","author":"Brown Tom","year":"1877","unstructured":"Tom Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared\u00a0 D Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel Ziegler , Jeffrey Wu , Clemens Winter , Chris Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , and Dario Amodei . 2020. Language Models are Few-Shot Learners . In Advances in Neural Information Processing Systems, H.\u00a0Larochelle, M.\u00a0Ranzato, R.\u00a0Hadsell, M.F. Balcan, and H.\u00a0Lin (Eds.). Vol.\u00a033. Curran Associates , Inc ., 1877 \u20131901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared\u00a0D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H.\u00a0Larochelle, M.\u00a0Ranzato, R.\u00a0Hadsell, M.F. Balcan, and H.\u00a0Lin (Eds.). Vol.\u00a033. Curran Associates, Inc., 1877\u20131901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508244.1508270"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541967"},{"key":"e_1_3_2_1_16_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_17_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the Conference on Systems and Machine Learning (SysML\u201919)","author":"Eisenman Assaf","year":"2019","unstructured":"Assaf Eisenman , Maxim Naumov , Darryl Gardner , Misha Smelyanskiy , Sergey Pupyrev , Kim Hazelwood , Asaf Cidon , and Sachin Katti . 2019 . Bandana: Using Non-Volatile Memory for Storing Deep Learning Models . In Proceedings of the Conference on Systems and Machine Learning (SysML\u201919) . Stanford, CA. Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2019. Bandana: Using Non-Volatile Memory for Storing Deep Learning Models. In Proceedings of the Conference on Systems and Machine Learning (SysML\u201919). Stanford, CA."},{"key":"e_1_3_2_1_19_1","unstructured":"[\n  19\n  ]  Examining AMD Radeon Pro SSG: How NAND Changes the GPU Game. [n. d.]. https:\/\/www.tomshardware.com\/news\/amd-radeon-pro-ssg 32365.html.  [19] Examining AMD Radeon Pro SSG: How NAND Changes the GPU Game. [n. d.]. https:\/\/www.tomshardware.com\/news\/amd-radeon-pro-ssg 32365.html."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322224"},{"key":"e_1_3_2_1_21_1","unstructured":"[\n  21\n  ]  GPUDirect Storage: A Direct Path Between Storage and GPU Memory. [n. d.]. https:\/\/developer.nvidia.com\/blog\/gpudirect-storage\/.  [21] GPUDirect Storage: A Direct Path Between Storage and GPU Memory. [n. d.]. https:\/\/developer.nvidia.com\/blog\/gpudirect-storage\/."},{"key":"e_1_3_2_1_22_1","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)","author":"Hao Yuchen","year":"2017","unstructured":"Yuchen Hao , Zhenman Fang , Glenn Reinman , and Jason Cong . 2017 . Supporting Address Translation for Accelerator-Centric Architectures . In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917) . Yuchen Hao, Zhenman Fang, Glenn Reinman, and Jason Cong. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_24_1","unstructured":"[\n  24\n  ]  High Bandwidth Memory. [n. d.]. https:\/\/en.wikipedia.org\/wiki\/High_Bandwidth_Memory.  [24] High Bandwidth Memory. [n. d.]. https:\/\/en.wikipedia.org\/wiki\/High_Bandwidth_Memory."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378465"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750420"},{"key":"e_1_3_2_1_29_1","unstructured":"[\n  29\n  ]  Huggingface 2023. Transformers. [n. d.]. https:\/\/github.com\/huggingface\/transformers\/tree\/main\/examples\/pytorch.  [29] Huggingface 2023. Transformers. [n. d.]. https:\/\/github.com\/huggingface\/transformers\/tree\/main\/examples\/pytorch."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378494"},{"key":"e_1_3_2_1_31_1","unstructured":"Intel. 2018. 3D XPoint: A Breakthrough in Non-Volatile Memory Technology. https:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/intel-micron-3d-xpoint-webcast.html.  Intel. 2018. 3D XPoint: A Breakthrough in Non-Volatile Memory Technology. https:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/intel-micron-3d-xpoint-webcast.html."},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys\u201920)","author":"Jain Paras","year":"2020","unstructured":"Paras Jain , Ajay Jain , Aniruddha Nrusimha , Amir Gholami , Pieter Abbeel , Joseph Gonzalez , Kurt Keutzer , and Ion Stoica . 2020 . Breaking the Memory Wall with Optimal Tensor Rematerialization . In Proceedings of Machine Learning and Systems (MLSys\u201920) . Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems (MLSys\u201920)."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1837915.1837922"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575736"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378529"},{"key":"e_1_3_2_1_36_1","volume-title":"Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914)","author":"Kim Sangman","year":"2014","unstructured":"Sangman Kim , Seonggu Huh , Xinya Zhang , Yige Hu , Amir Wated , Emmett Witchel , and Mark Silberstein . 2014 . GPUnet: Networking Abstractions for GPU Programs . In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914) . Broomfield, CO. Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201914). Broomfield, CO."},{"key":"e_1_3_2_1_37_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.  Alex Krizhevsky Ilya Sutskever and Geoffrey\u00a0E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems."},{"key":"e_1_3_2_1_38_1","volume-title":"Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919)","author":"Kwon Youngeun","year":"2019","unstructured":"Youngeun Kwon , Yunjae Lee , and Minsoo Rhu . 2019 . TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning . In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919) . Columbus, OH, USA. Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919). Columbus, OH, USA."},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918)","author":"Kwon Youngeun","year":"2018","unstructured":"Youngeun Kwon and Minsoo Rhu . 2018 . Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning . In Proceedings of the 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918) . Fukuoka, Japan. Youngeun Kwon and Minsoo Rhu. 2018. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning. In Proceedings of the 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918). Fukuoka, Japan."},{"key":"e_1_3_2_1_40_1","volume-title":"TFLMS: Large Model Support in TensorFlow by Graph Rewriting. arxiv:1807.02037\u00a0[cs.LG]","author":"Le D.","year":"2019","unstructured":"Tung\u00a0 D. Le , Haruki Imai , Yasushi Negishi , and Kiyokuni Kawachiya . 2019 . TFLMS: Large Model Support in TensorFlow by Graph Rewriting. arxiv:1807.02037\u00a0[cs.LG] Tung\u00a0D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. 2019. TFLMS: Large Model Support in TensorFlow by Graph Rewriting. arxiv:1807.02037\u00a0[cs.LG]"},{"key":"e_1_3_2_1_41_1","volume-title":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).","author":"Lew Jonathan","year":"2019","unstructured":"Jonathan Lew , Deval\u00a0 A. Shah , Suchita Pati , Shaylin Cattell , Mengchi Zhang , Amruth Sandhupatla , Christopher Ng , Negar Goli , Matthew\u00a0 D. Sinclair , Timothy\u00a0 G. Rogers , and Tor\u00a0 M. Aamodt . 2019 . Analyzing Machine Learning Workloads Using a Detailed GPU Simulator . In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Jonathan Lew, Deval\u00a0A. Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew\u00a0D. Sinclair, Timothy\u00a0G. Rogers, and Tor\u00a0M. Aamodt. 2019. Analyzing Machine Learning Workloads Using a Detailed GPU Simulator. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2928289"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.15"},{"key":"e_1_3_2_1_44_1","unstructured":"[\n  44\n  ]  NVIDIA H100 Tensor Core GPU. [n. d.]. https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/.  [44] NVIDIA H100 Tensor Core GPU. [n. d.]. https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/."},{"key":"e_1_3_2_1_45_1","volume-title":"Proceedings of the 13th European Conference on Computer Systems (EuroSys\u201918)","author":"Peng Yanghua","year":"2018","unstructured":"Yanghua Peng , Yixin Bao , Yangrui Chen , Chuan Wu , and Chuanxiong Guo . 2018 . Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters . In Proceedings of the 13th European Conference on Computer Systems (EuroSys\u201918) . Porto, Portugal. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In Proceedings of the 13th European Conference on Computer Systems (EuroSys\u201918). Porto, Portugal."},{"key":"e_1_3_2_1_46_1","unstructured":"[\n  46\n  ]  PyTorch 2023. PyTorch Examples. [n. d.]. https:\/\/pytorch.org\/examples\/#pytorch-examples.  [46] PyTorch 2023. PyTorch Examples. [n. d.]. https:\/\/pytorch.org\/examples\/#pytorch-examples."},{"key":"e_1_3_2_1_47_1","volume-title":"BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage. arXiv preprint arXiv:2203.04910","author":"Qureshi Zaid","year":"2022","unstructured":"Zaid Qureshi , Vikram\u00a0Sharma Mailthody , Isaac Gelado , Seung\u00a0Won Min , Amna Masood , Jeongmin Park , Jinjun Xiong , CJ Newburn , Dmitri Vainbrand , I Chung , 2022. BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage. arXiv preprint arXiv:2203.04910 ( 2022 ). Zaid Qureshi, Vikram\u00a0Sharma Mailthody, Isaac Gelado, Seung\u00a0Won Min, Amna Masood, Jeongmin Park, Jinjun Xiong, CJ Newburn, Dmitri Vainbrand, I Chung, 2022. BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage. arXiv preprint arXiv:2203.04910 (2022)."},{"key":"e_1_3_2_1_48_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Rajbhandari Samyam","year":"2020","unstructured":"Samyam Rajbhandari , Jeff Rasley , Olatunji Ruwase , and Yuxiong He . 2020 . ZeRO: Memory Optimizations toward Training Trillion Parameter Models . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis ( Atlanta, Georgia) (SC \u201920). IEEE Press, Article 20, 16\u00a0pages. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations toward Training Trillion Parameter Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC \u201920). IEEE Press, Article 20, 16\u00a0pages."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.524.0465"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00057"},{"key":"e_1_3_2_1_52_1","volume-title":"ZeRO-Offload: Democratizing Billion-Scale Model Training. CoRR abs\/2101.06840","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza\u00a0Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. CoRR abs\/2101.06840 ( 2021 ). https:\/\/arxiv.org\/abs\/2101.06840 Jie Ren, Samyam Rajbhandari, Reza\u00a0Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. CoRR abs\/2101.06840 (2021). https:\/\/arxiv.org\/abs\/2101.06840"},{"key":"e_1_3_2_1_53_1","unstructured":"Samsung. [n. d.]. Samsung Z-SSD SZ985. https:\/\/semiconductor.samsung.com\/resources\/brochure\/Brochure_Samsung_S-ZZD_SZ985_1804.pdf  Samsung. [n. d.]. Samsung Z-SSD SZ985. https:\/\/semiconductor.samsung.com\/resources\/brochure\/Brochure_Samsung_S-ZZD_SZ985_1804.pdf"},{"key":"e_1_3_2_1_54_1","unstructured":"[\n  54\n  ]  Samsung Z-NAND. [n. d.]. https:\/\/www.samsung.com\/semiconductor\/ssd\/z-ssd\/.  [54] Samsung Z-NAND. [n. d.]. https:\/\/www.samsung.com\/semiconductor\/ssd\/z-ssd\/."},{"key":"e_1_3_2_1_55_1","volume-title":"Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference(USENIXATC\u201910)","author":"Saxena Mohit","year":"2010","unstructured":"Mohit Saxena and Michael\u00a0 M. Swift . 2010 . FlashVM: Virtual Memory Management on Flash . In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference(USENIXATC\u201910) . Boston, MA, 187\u2013200. Mohit Saxena and Michael\u00a0M. Swift. 2010. FlashVM: Virtual Memory Management on Flash. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference(USENIXATC\u201910). Boston, MA, 187\u2013200."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001200"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451169"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.5555\/3298023.3298188"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_60_1","volume-title":"International conference on machine learning. PMLR, 6105\u20136114","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le . 2019 . Efficientnet: Rethinking model scaling for convolutional neural networks . In International conference on machine learning. PMLR, 6105\u20136114 . Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105\u20136114."},{"key":"e_1_3_2_1_61_1","unstructured":"[\n  61\n  ]  The World\u2019s First GPU to Break the Terabyte Memory Barrier. [n. d.]. https:\/\/www.amd.com\/en\/products\/professional-graphics\/radeon-pro-ssg.  [61] The World\u2019s First GPU to Break the Terabyte Memory Barrier. [n. d.]. https:\/\/www.amd.com\/en\/products\/professional-graphics\/radeon-pro-ssg."},{"key":"e_1_3_2_1_62_1","unstructured":"[\n  62\n  ]  Unified CPU\/GPU Memory Architecture Raises the Performance Bar. [n. d.]. https:\/\/www.electronicdesign.com\/technologies\/microcontrollers\/article\/21796296\/unified-cpugpu-memory-architecture-raises-the-performance-bar.  [62] Unified CPU\/GPU Memory Architecture Raises the Performance Bar. [n. d.]. https:\/\/www.electronicdesign.com\/technologies\/microcontrollers\/article\/21796296\/unified-cpugpu-memory-architecture-raises-the-performance-bar."},{"key":"e_1_3_2_1_63_1","unstructured":"[\n  63\n  ]  Unified Memory for CUDA Beginners. [n. d.]. https:\/\/developer.nvidia.com\/blog\/unified-memory-cuda-beginners\/.  [63] Unified Memory for CUDA Beginners. [n. d.]. https:\/\/developer.nvidia.com\/blog\/unified-memory-cuda-beginners\/."},{"key":"e_1_3_2_1_64_1","volume-title":"Neural Network Acceptability Judgments. arXiv preprint arXiv:1805.12471","author":"Warstadt Alex","year":"2018","unstructured":"Alex Warstadt , Amanpreet Singh , and Samuel\u00a0 R Bowman . 2018. Neural Network Acceptability Judgments. arXiv preprint arXiv:1805.12471 ( 2018 ). Alex Warstadt, Amanpreet Singh, and Samuel\u00a0R Bowman. 2018. Neural Network Acceptability Judgments. arXiv preprint arXiv:1805.12471 (2018)."},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_2_1_66_1","volume-title":"Wide residual networks. arXiv preprint arXiv:1605.07146","author":"Zagoruyko Sergey","year":"2016","unstructured":"Sergey Zagoruyko and Nikos Komodakis . 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 ( 2016 ). Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)."},{"key":"e_1_3_2_1_67_1","volume-title":"Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920)","author":"Zhang Jie","year":"2020","unstructured":"Jie Zhang and Myoungsoo Jung . 2020 . ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis . In Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920) . Jie Zhang and Myoungsoo Jung. 2020. ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis. In Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920)."},{"key":"e_1_3_2_1_68_1","volume-title":"Proceedings of the 56th Annual Design Automation Conference (DAC\u201919)","author":"Zhang Jie","year":"2019","unstructured":"Jie Zhang , Miryeong Kwon , Hyojong Kim , Hyesoon Kim , and Myoungsoo Jung . 2019 . FlashGPU: Placing New Flash Next to GPU Cores . In Proceedings of the 56th Annual Design Automation Conference (DAC\u201919) (Las Vegas, NV, USA). Jie Zhang, Miryeong Kwon, Hyojong Kim, Hyesoon Kim, and Myoungsoo Jung. 2019. FlashGPU: Placing New Flash Next to GPU Cores. In Proceedings of the 56th Annual Design Automation Conference (DAC\u201919) (Las Vegas, NV, USA)."},{"key":"e_1_3_2_1_69_1","volume-title":"Proc. 10th USENIX FAST","author":"Zhang Yiying","year":"2012","unstructured":"Yiying Zhang , Leo\u00a0Prasath Arulraj , Andrea\u00a0 C. Arpaci-Dusseau , and Remzi\u00a0 H. Arpaci-Dusseau . 2012 . De-indirection for Flash-based SSDs with Nameless Writes . In Proc. 10th USENIX FAST . San Jose, CA. Yiying Zhang, Leo\u00a0Prasath Arulraj, Andrea\u00a0C. Arpaci-Dusseau, and Remzi\u00a0H. Arpaci-Dusseau. 2012. De-indirection for Flash-based SSDs with Nameless Writes. In Proc. 10th USENIX FAST. San Jose, CA."},{"key":"e_1_3_2_1_70_1","article-title":"Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface","volume":"10","author":"Zhao Jishen","year":"2013","unstructured":"Jishen Zhao , Guangyu Sun , Gabriel\u00a0 H. Loh , and Yuan Xie . 2013 . Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface . ACM Trans. Archit. Code Optim. 10 , 4, Article 24 (Dec. 2013). Jishen Zhao, Guangyu Sun, Gabriel\u00a0H. Loh, and Yuan Xie. 2013. Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface. ACM Trans. Archit. Code Optim. 10, 4, Article 24 (Dec. 2013).","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446077"}],"event":{"name":"MICRO '23: 56th Annual IEEE\/ACM International Symposium on Microarchitecture","location":"Toronto ON Canada","acronym":"MICRO '23","sponsor":["SIGMICRO ACM Special Interest Group on Microarchitectural Research and Processing"]},"container-title":["56th Annual IEEE\/ACM International Symposium on Microarchitecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613424.3614309","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3613424.3614309","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3613424.3614309","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:30Z","timestamp":1750178190000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613424.3614309"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,28]]},"references-count":71,"alternative-id":["10.1145\/3613424.3614309","10.1145\/3613424"],"URL":"https:\/\/doi.org\/10.1145\/3613424.3614309","relation":{},"subject":[],"published":{"date-parts":[[2023,10,28]]},"assertion":[{"value":"2023-12-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}