{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T16:28:17Z","timestamp":1779294497764,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":62,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,28]],"date-time":"2023-10-28T00:00:00Z","timestamp":1698451200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,28]]},"DOI":"10.1145\/3613424.3614299","type":"proceedings-article","created":{"date-parts":[[2023,12,8]],"date-time":"2023-12-08T17:22:15Z","timestamp":1702056135000},"page":"438-451","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4773-9251","authenticated-orcid":false,"given":"Jungwoo","family":"Kim","sequence":"first","affiliation":[{"name":"KAIST, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0734-8126","authenticated-orcid":false,"given":"Seonjin","family":"Na","sequence":"additional","affiliation":[{"name":"KAIST, Korea, South ? Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7060-6315","authenticated-orcid":false,"given":"Sanghyeon","family":"Lee","sequence":"additional","affiliation":[{"name":"KAIST, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4362-9565","authenticated-orcid":false,"given":"Sunho","family":"Lee","sequence":"additional","affiliation":[{"name":"KAIST, Korea, South ? Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1742-047X","authenticated-orcid":false,"given":"Jaehyuk","family":"Huh","sequence":"additional","affiliation":[{"name":"KAIST, Korea, South ? Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,8]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"International Solid-State Circuits Conference (ISSCC).","author":"Agrawal Ankur","year":"2021","unstructured":"Ankur Agrawal , Sae\u00a0Kyu Lee , Joel Silberman , Matthew Ziegler , Mingu Kang , Swagath Venkataramani , Nianzheng Cao , Bruce Fleischer , Michael Guillorn , Matthew Cohen , 2021 . 9.1 a 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling . In International Solid-State Circuits Conference (ISSCC). Ankur Agrawal, Sae\u00a0Kyu Lee, Joel Silberman, Matthew Ziegler, Mingu Kang, Swagath Venkataramani, Nianzheng Cao, Bruce Fleischer, Michael Guillorn, Matthew Cohen, 2021. 9.1 a 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling. In International Solid-State Circuits Conference (ISSCC)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195664"},{"key":"e_1_3_2_1_4_1","volume-title":"ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).","author":"Baek Eunjin","year":"2020","unstructured":"Eunjin Baek , Dongup Kwon , and Jangwoo Kim . 2020 . A multi-neural network acceleration architecture . In ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). Eunjin Baek, Dongup Kwon, and Jangwoo Kim. 2020. A multi-neural network acceleration architecture. In ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_3_2_1_5_1","unstructured":"Simon Boehm. 2022. How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. https:\/\/siboehm.com\/articles\/22\/CUDA-MMM  Simon Boehm. 2022. How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. https:\/\/siboehm.com\/articles\/22\/CUDA-MMM"},{"key":"e_1_3_2_1_6_1","volume-title":"Conference on Neural Information Processing Systems (NIPS).","author":"Brown Tom","year":"2020","unstructured":"Tom Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared\u00a0 D Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , 2020 . Language models are few-shot learners . In Conference on Neural Information Processing Systems (NIPS). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared\u00a0D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. In Conference on Neural Information Processing Systems (NIPS)."},{"key":"e_1_3_2_1_7_1","volume-title":"Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators. In Journal of ACM Transactions on Architecture and Code Optimization (TACO).","author":"Chatarasi Prasanth","year":"2021","unstructured":"Prasanth Chatarasi , Hyoukjun Kwon , Angshuman Parashar , Michael Pellauer , Tushar Krishna , and Vivek Sarkar . 2021 . Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators. In Journal of ACM Transactions on Architecture and Code Optimization (TACO). Prasanth Chatarasi, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2021. Marvel: A data-centric approach for mapping deep learning operators on spatial accelerators. In Journal of ACM Transactions on Architecture and Code Optimization (TACO)."},{"key":"e_1_3_2_1_8_1","volume-title":"Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News.","author":"Chen Tianshi","year":"2014","unstructured":"Tianshi Chen , Zidong Du , Ninghui Sun , Jia Wang , Chengyong Wu , Yunji Chen , and Olivier Temam . 2014 . Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM SIGARCH Computer Architecture News."},{"key":"e_1_3_2_1_9_1","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI).","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , 2018 . TVM: An automated end-to-end optimizing compiler for deep learning . In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI)."},{"key":"e_1_3_2_1_10_1","first-page":"06174","article-title":"Training deep nets with sublinear memory cost","volume":"1604","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen , Bing Xu , Chiyuan Zhang , and Carlos Guestrin . 2016 . Training deep nets with sublinear memory cost . In ArXiv Preprint ArXiv : 1604 . 06174 . Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. In ArXiv Preprint ArXiv:1604.06174.","journal-title":"ArXiv Preprint ArXiv"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001177"},{"key":"e_1_3_2_1_12_1","volume-title":"USENIX Annual Technical Conference (ATC).","author":"Choi Seungbeom","year":"2022","unstructured":"Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , and Jaehyuk Huh . 2022 . Serving heterogeneous machine learning models on Multi-GPU servers with Spatio-Temporal sharing . In USENIX Annual Technical Conference (ATC). Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on Multi-GPU servers with Spatio-Temporal sharing. In USENIX Annual Technical Conference (ATC)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Seungkyu Choi Jaehyeong Sim Myeonggu Kang Yeongjae Choi Hyeonuk Kim and Lee-Sup Kim. 2020. An energy-efficient deep convolutional neural network training accelerator for in Situ personalization on smart devices. In Journal of Solid-State Circuits (JSSC).  Seungkyu Choi Jaehyeong Sim Myeonggu Kang Yeongjae Choi Hyeonuk Kim and Lee-Sup Kim. 2020. An energy-efficient deep convolutional neural network training accelerator for in Situ personalization on smart devices. In Journal of Solid-State Circuits (JSSC).","DOI":"10.1109\/JSSC.2020.3005786"},{"key":"e_1_3_2_1_14_1","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. In arXiv preprint arXiv:2307.08691.  Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. In arXiv preprint arXiv:2307.08691."},{"key":"e_1_3_2_1_15_1","unstructured":"Tri Dao Dan Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NIPS).  Tri Dao Dan Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NIPS)."},{"key":"e_1_3_2_1_16_1","volume-title":"International Symposium on Microarchitecture (MICRO).","author":"Drumond Mario","year":"2021","unstructured":"Mario Drumond , Louis Coulon , Arash Pourhabibi , Ahmet\u00a0Caner Y\u00fcz\u00fcg\u00fcler , Babak Falsafi , and Martin Jaggi . 2021 . Equinox: Training (for Free) on a custom inference accelerator . In International Symposium on Microarchitecture (MICRO). Mario Drumond, Louis Coulon, Arash Pourhabibi, Ahmet\u00a0Caner Y\u00fcz\u00fcg\u00fcler, Babak Falsafi, and Martin Jaggi. 2021. Equinox: Training (for Free) on a custom inference accelerator. In International Symposium on Microarchitecture (MICRO)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750389"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037702"},{"key":"e_1_3_2_1_19_1","volume-title":"Design Automation Conference (DAC).","author":"Genc Hasan","year":"2021","unstructured":"Hasan Genc , Seah Kim , Alon Amid , Ameer Haj-Ali , Vighnesh Iyer , Pranav Prakash , Jerry Zhao , Daniel Grubb , Harrison Liew , Howard Mao , 2021 . Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration . In Design Automation Conference (DAC). Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, 2021. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Design Automation Conference (DAC)."},{"key":"e_1_3_2_1_20_1","unstructured":"Google. 2023. Tensorflow XLA. https:\/\/www.tensorflow.org\/xla  Google. 2023. Tensorflow XLA. https:\/\/www.tensorflow.org\/xla"},{"key":"e_1_3_2_1_21_1","unstructured":"Google. 2023. Understanding data sharding (data parallelism) : Google Cloud TPU. https:\/\/cloud.google.com\/tpu\/docs\/troubleshooting\/trouble-tf#data-parallelism  Google. 2023. Understanding data sharding (data parallelism) : Google Cloud TPU. https:\/\/cloud.google.com\/tpu\/docs\/troubleshooting\/trouble-tf#data-parallelism"},{"key":"e_1_3_2_1_22_1","unstructured":"Google Cloud. 2023. Cloud TPU programming model. https:\/\/cloud.google.com\/tpu\/docs\/tpus#programming_model  Google Cloud. 2023. Cloud TPU programming model. https:\/\/cloud.google.com\/tpu\/docs\/tpus#programming_model"},{"key":"e_1_3_2_1_23_1","volume-title":"System Architecture: TPU Chip. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_chip","author":"Cloud Google","year":"2023","unstructured":"Google Cloud . 2023 . System Architecture: TPU Chip. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_chip Google Cloud. 2023. System Architecture: TPU Chip. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_chip"},{"key":"e_1_3_2_1_24_1","volume-title":"System Architecture: TPU v4. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_v4","author":"Cloud Google","year":"2023","unstructured":"Google Cloud . 2023 . System Architecture: TPU v4. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_v4 Google Cloud. 2023. System Architecture: TPU v4. https:\/\/cloud.google.com\/tpu\/docs\/system-architecture-tpu-vm#tpu_v4"},{"key":"e_1_3_2_1_25_1","first-page":"02677","article-title":"Accurate, large minibatch sgd: Training imagenet in 1 hour","volume":"1706","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He . 2017 . Accurate, large minibatch sgd: Training imagenet in 1 hour . In ArXiv Preprint ArXiv : 1706 . 02677 . Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. In ArXiv Preprint ArXiv:1706.02677.","journal-title":"ArXiv Preprint ArXiv"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. In Transactions on Mathematical Software (TOMS).  Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. In Transactions on Mathematical Software (TOMS).","DOI":"10.1145\/347837.347846"},{"key":"e_1_3_2_1_27_1","volume-title":"Proceedings of the Fourth Workshop on Data analytics in the Cloud (DanaC).","author":"Hadjis Stefan","year":"2015","unstructured":"Stefan Hadjis , Firas Abuzaid , Ce Zhang , and Christopher R\u00e9 . 2015 . Caffe con troll: Shallow ideas to speed up deep learning . In Proceedings of the Fourth Workshop on Data analytics in the Cloud (DanaC). Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher R\u00e9. 2015. Caffe con troll: Shallow ideas to speed up deep learning. In Proceedings of the Fourth Workshop on Data analytics in the Cloud (DanaC)."},{"key":"e_1_3_2_1_28_1","volume-title":"2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).","author":"Han Myeonggyun","year":"2019","unstructured":"Myeonggyun Han , Jihoon Hyun , Seongbeom Park , Jinsu Park , and Woongki Baek . 2019 . Mosaic: Heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference . In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). Myeonggyun Han, Jihoon Hyun, Seongbeom Park, Jinsu Park, and Woongki Baek. 2019. Mosaic: Heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)."},{"key":"e_1_3_2_1_29_1","first-page":"11205","article-title":"Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes","volume":"1807","author":"Jia Xianyan","year":"2018","unstructured":"Xianyan Jia , Shutao Song , Wei He , Yangzihao Wang , Haidong Rong , Feihu Zhou , Liqiang Xie , Zhenyu Guo , Yuanzhou Yang , Liwei Yu , 2018 . Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes . In ArXiv Preprint ArXiv : 1807 . 11205 . Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. In ArXiv Preprint ArXiv: 1807.11205.","journal-title":"ArXiv Preprint ArXiv"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Norman\u00a0P Jouppi Doe\u00a0Hyun Yoon George Kurian Sheng Li Nishant Patil James Laudon Cliff Young and David Patterson. 2020. A domain-specific supercomputer for training deep neural networks. In Communications of the ACM (CACM).  Norman\u00a0P Jouppi Doe\u00a0Hyun Yoon George Kurian Sheng Li Nishant Patil James Laudon Cliff Young and David Patterson. 2020. A domain-specific supercomputer for training deep neural networks. In Communications of the ACM (CACM).","DOI":"10.1145\/3360307"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_1_32_1","volume-title":"International Solid-State Circuits Conference (ISSCC).","author":"Kang Sanghoon","year":"2020","unstructured":"Sanghoon Kang , Donghyeon Han , Juhyoung Lee , Dongseok Im , Sangyeob Kim , Soyeon Kim , and Hoi-Jun Yoo . 2020 . 7.4 GANPU: A 135TFLOPS\/W multi-DNN training processor for GANs with speculative dual-sparsity exploitation . In International Solid-State Circuits Conference (ISSCC). Sanghoon Kang, Donghyeon Han, Juhyoung Lee, Dongseok Im, Sangyeob Kim, Soyeon Kim, and Hoi-Jun Yoo. 2020. 7.4 GANPU: A 135TFLOPS\/W multi-DNN training processor for GANs with speculative dual-sparsity exploitation. In International Solid-State Circuits Conference (ISSCC)."},{"key":"e_1_3_2_1_33_1","volume-title":"Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD).","author":"Kao Sheng-Chun","year":"2020","unstructured":"Sheng-Chun Kao and Tushar Krishna . 2020 . GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm . In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD). Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575747"},{"key":"e_1_3_2_1_35_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Keskar Nitish\u00a0Shirish","year":"2017","unstructured":"Nitish\u00a0Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , and Ping Tak\u00a0Peter Tang . 2017 . On large-batch training for deep learning: Generalization gap and sharp minima . In International Conference on Learning Representations (ICLR). Nitish\u00a0Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak\u00a0Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_36_1","volume-title":"Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings","author":"Kwon Hyoukjun","year":"2020","unstructured":"Hyoukjun Kwon , Prasanth Chatarasi , Vivek Sarkar , Tushar Krishna , Michael Pellauer , and Angshuman Parashar . 2020 . Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings . In IEEE micro. Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of DNN mappings. In IEEE micro."},{"key":"e_1_3_2_1_37_1","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA).","author":"Kwon Hyoukjun","year":"2021","unstructured":"Hyoukjun Kwon , Liangzhen Lai , Michael Pellauer , Tushar Krishna , Yu-Hsin Chen , and Vikas Chandra . 2021 . Heterogeneous dataflow accelerators for multi-DNN workloads . In IEEE International Symposium on High-Performance Computer Architecture (HPCA). Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. 2021. Heterogeneous dataflow accelerators for multi-DNN workloads. In IEEE International Symposium on High-Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_38_1","volume-title":"International Solid-State Circuits Conference (ISSCC).","author":"Lee Jinsu","year":"2019","unstructured":"Jinsu Lee , Juhyoung Lee , Donghyeon Han , Jinmook Lee , Gwangtae Park , and Hoi-Jun Yoo . 2019 . 7.7 LNPU: A 25.3 TFLOPS\/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16 . In International Solid-State Circuits Conference (ISSCC). Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, and Hoi-Jun Yoo. 2019. 7.7 LNPU: A 25.3 TFLOPS\/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. In International Solid-State Circuits Conference (ISSCC)."},{"key":"e_1_3_2_1_39_1","volume-title":"IEEE International Symposium on High-Performance Computer Architecture (HPCA).","author":"Lee Sunho","year":"2022","unstructured":"Sunho Lee , Jungwoo Kim , Seonjin Na , Jongse Park , and Jaehyuk Huh . 2022 . TNPU: Supporting trusted execution with tree-less integrity protection for neural processing unit . In IEEE International Symposium on High-Performance Computer Architecture (HPCA). Sunho Lee, Jungwoo Kim, Seonjin Na, Jongse Park, and Jaehyuk Huh. 2022. TNPU: Supporting trusted execution with tree-less integrity protection for neural processing unit. In IEEE International Symposium on High-Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_40_1","volume-title":"International Conference on Parallel Processing (ICPP).","author":"Li Xiaqing","year":"2016","unstructured":"Xiaqing Li , Guangyan Zhang , H\u00a0Howie Huang , Zhufan Wang , and Weimin Zheng . 2016 . Performance analysis of GPU-based convolutional neural networks . In International Conference on Parallel Processing (ICPP). Xiaqing Li, Guangyan Zhang, H\u00a0Howie Huang, Zhufan Wang, and Weimin Zheng. 2016. Performance analysis of GPU-based convolutional neural networks. In International Conference on Parallel Processing (ICPP)."},{"key":"e_1_3_2_1_41_1","volume-title":"Federated learning in mobile edge networks: A comprehensive survey","author":"Yang\u00a0Bryan Lim Wei","unstructured":"Wei Yang\u00a0Bryan Lim , Nguyen\u00a0Cong Luong , Dinh\u00a0Thai Hoang , Yutao Jiao , Ying-Chang Liang , Qiang Yang , Dusit Niyato , and Chunyan Miao . 2020. Federated learning in mobile edge networks: A comprehensive survey . In IEEE Communications Surveys & Tutorials . Wei Yang\u00a0Bryan Lim, Nguyen\u00a0Cong Luong, Dinh\u00a0Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. 2020. Federated learning in mobile edge networks: A comprehensive survey. In IEEE Communications Surveys & Tutorials."},{"key":"e_1_3_2_1_42_1","first-page":"07612","article-title":"Revisiting small batch training for deep neural networks","volume":"1804","author":"Masters Dominic","year":"2018","unstructured":"Dominic Masters and Carlo Luschi . 2018 . Revisiting small batch training for deep neural networks . In ArXiv Preprint ArXiv : 1804 . 07612 . Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. In ArXiv Preprint ArXiv:1804.07612.","journal-title":"ArXiv Preprint ArXiv"},{"key":"e_1_3_2_1_43_1","volume-title":"Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication","author":"Moon Gordon\u00a0Euhyun","unstructured":"Gordon\u00a0Euhyun Moon , Hyoukjun Kwon , Geonhwa Jeong , Prasanth Chatarasi , Sivasankaran Rajamanickam , and Tushar Krishna . 2021. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication . In IEEE Transactions on Parallel and Distributed Systems (TPDS) . Gordon\u00a0Euhyun Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, and Tushar Krishna. 2021. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication. In IEEE Transactions on Parallel and Distributed Systems (TPDS)."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3453483.3454083"},{"key":"e_1_3_2_1_45_1","volume-title":"The design process for Google\u2019s training chips: TPUv2 and TPUv3","author":"Norrie Thomas","unstructured":"Thomas Norrie , Nishant Patil , Doe\u00a0Hyun Yoon , George Kurian , Sheng Li , James Laudon , Cliff Young , Norman Jouppi , and David Patterson . 2021. The design process for Google\u2019s training chips: TPUv2 and TPUv3 . In IEEE Micro . Thomas Norrie, Nishant Patil, Doe\u00a0Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. The design process for Google\u2019s training chips: TPUv2 and TPUv3. In IEEE Micro."},{"key":"e_1_3_2_1_46_1","unstructured":"NVIDIA. 2023. FasterTransformer. https:\/\/github.com\/NVIDIA\/FasterTransformer  NVIDIA. 2023. FasterTransformer. https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_3_2_1_47_1","volume-title":"European Conference on Computer Systems (EuroSys).","author":"Oh Hyungjun","year":"2022","unstructured":"Hyungjun Oh , Junyeol Lee , Hyeongju Kim , and Jiwon Seo . 2022 . Out-of-order backprop: An effective scheduling technique for deep learning . In European Conference on Computer Systems (EuroSys). Hyungjun Oh, Junyeol Lee, Hyeongju Kim, and Jiwon Seo. 2022. Out-of-order backprop: An effective scheduling technique for deep learning. In European Conference on Computer Systems (EuroSys)."},{"key":"e_1_3_2_1_48_1","volume-title":"Symposium on VLSI Circuits (VLSI-circuits).","author":"Oh Jinwook","year":"2020","unstructured":"Jinwook Oh , Sae\u00a0Kyu Lee , Mingu Kang , Matthew Ziegler , Joel Silberman , Ankur Agrawal , Swagath Venkataramani , Bruce Fleischer , Michael Guillorn , Jungwook Choi , 2020 . A 3.0 TFLOPS 0.62 V scalable processor core for high compute utilization AI training and inference . In Symposium on VLSI Circuits (VLSI-circuits). Jinwook Oh, Sae\u00a0Kyu Lee, Mingu Kang, Matthew Ziegler, Joel Silberman, Ankur Agrawal, Swagath Venkataramani, Bruce Fleischer, Michael Guillorn, Jungwook Choi, 2020. A 3.0 TFLOPS 0.62 V scalable processor core for high compute utilization AI training and inference. In Symposium on VLSI Circuits (VLSI-circuits)."},{"key":"e_1_3_2_1_49_1","volume-title":"International Symposium on High Performance Computer Architecture (HPCA).","author":"Oh H","year":"2021","unstructured":"Young\u00a0 H Oh , Seonghak Kim , Yunho Jin , Sam Son , Jonghyun Bae , Jongsung Lee , Yeonhong Park , Dong\u00a0Uk Kim , Tae\u00a0Jun Ham , and Jae\u00a0 W Lee . 2021 . Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling . In International Symposium on High Performance Computer Architecture (HPCA). Young\u00a0H Oh, Seonghak Kim, Yunho Jin, Sam Son, Jonghyun Bae, Jongsung Lee, Yeonhong Park, Dong\u00a0Uk Kim, Tae\u00a0Jun Ham, and Jae\u00a0W Lee. 2021. Layerweaver: Maximizing resource utilization of neural processing units via layer-wise scheduling. In International Symposium on High Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_2_1_51_1","volume-title":"Conference on Neural Information Processing Systems (NIPS).","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , Alban Desmaison , Andreas Kopf , Edward Yang , Zachary DeVito , Martin Raison , Alykhan Tejani , Sasank Chilamkurthy , Benoit Steiner , Lu Fang , Junjie Bai , and Soumith Chintala . 2019 . PyTorch: An imperative style, high-performance deep learning library . In Conference on Neural Information Processing Systems (NIPS). Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NIPS)."},{"key":"e_1_3_2_1_52_1","volume-title":"International Symposium on High Performance Computer Architecture (HPCA).","author":"Qin Eric","year":"2020","unstructured":"Eric Qin , Ananda Samajdar , Hyoukjun Kwon , Vineet Nadella , Sudarshan Srinivasan , Dipankar Das , Bharat Kaul , and Tushar Krishna . 2020 . SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training . In International Symposium on High Performance Computer Architecture (HPCA). Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In International Symposium on High Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_53_1","volume-title":"Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Acm Sigplan Notices.","author":"Ragan-Kelley Jonathan","year":"2013","unstructured":"Jonathan Ragan-Kelley , Connelly Barnes , Andrew Adams , Sylvain Paris , Fr\u00e9do Durand , and Saman Amarasinghe . 2013 . Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Acm Sigplan Notices. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr\u00e9do Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Acm Sigplan Notices."},{"key":"e_1_3_2_1_54_1","volume-title":"Conference on Artificial Intelligence (AAAI).","author":"Real Esteban","year":"2019","unstructured":"Esteban Real , Alok Aggarwal , Yanping Huang , and Quoc\u00a0 V Le . 2019 . Regularized evolution for image classifier architecture search . In Conference on Artificial Intelligence (AAAI). Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc\u00a0V Le. 2019. Regularized evolution for image classifier architecture search. In Conference on Artificial Intelligence (AAAI)."},{"key":"e_1_3_2_1_55_1","volume-title":"International Symposium on Performance Analysis of Systems and Software (ISPASS).","author":"Samajdar Ananda","year":"2020","unstructured":"Ananda Samajdar , Jan\u00a0Moritz Joseph , Yuhao Zhu , Paul Whatmough , Matthew Mattina , and Tushar Krishna . 2020 . A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim . In International Symposium on Performance Analysis of Systems and Software (ISPASS). Ananda Samajdar, Jan\u00a0Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In International Symposium on Performance Analysis of Systems and Software (ISPASS)."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Wonik Seo Sanghoon Cha Yeonjae Kim Jaehyuk Huh and Jongse Park. 2021. SLO-aware inference scheduler for heterogeneous processors in edge platforms. (2021).  Wonik Seo Sanghoon Cha Yeonjae Kim Jaehyuk Huh and Jongse Park. 2021. SLO-aware inference scheduler for heterogeneous processors in edge platforms. (2021).","DOI":"10.1145\/3460352"},{"key":"e_1_3_2_1_57_1","volume-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism. In arXiv preprint arXiv","author":"Shoeybi Mohammad","year":"1909","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. In arXiv preprint arXiv : 1909 .08053. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. In arXiv preprint arXiv: 1909.08053."},{"key":"e_1_3_2_1_58_1","volume-title":"International Symposium on High Performance Computer Architecture (HPCA).","author":"Song Mingcong","year":"2018","unstructured":"Mingcong Song , Kan Zhong , Jiaqi Zhang , Yang Hu , Duo Liu , Weigong Zhang , Jing Wang , and Tao Li . 2018 . In-Situ AI: Towards autonomous and incremental deep learning for IoT systems . In International Symposium on High Performance Computer Architecture (HPCA). Mingcong Song, Kan Zhong, Jiaqi Zhang, Yang Hu, Duo Liu, Weigong Zhang, Jing Wang, and Tao Li. 2018. In-Situ AI: Towards autonomous and incremental deep learning for IoT systems. In International Symposium on High Performance Computer Architecture (HPCA)."},{"key":"e_1_3_2_1_59_1","volume-title":"International Symposium on Circuits and Systems (ISCAS).","author":"Vucetic Danilo","year":"2022","unstructured":"Danilo Vucetic , Mohammadreza Tayaranian , Maryam Ziaeefard , James\u00a0 J Clark , Brett\u00a0 H Meyer , and Warren\u00a0 J Gross . 2022 . Efficient fine-tuning of BERT models on the edge . In International Symposium on Circuits and Systems (ISCAS). Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James\u00a0J Clark, Brett\u00a0H Meyer, and Warren\u00a0J Gross. 2022. Efficient fine-tuning of BERT models on the edge. In International Symposium on Circuits and Systems (ISCAS)."},{"key":"e_1_3_2_1_60_1","volume-title":"PL-NPU: An energy-efficient edge-device DNN training processor with posit-based logarithm-domain computing","author":"Wang Yang","unstructured":"Yang Wang , Dazheng Deng , Leibo Liu , Shaojun Wei , and Shouyi Yin . 2022. PL-NPU: An energy-efficient edge-device DNN training processor with posit-based logarithm-domain computing . In IEEE Transactions on Circuits and Systems (TCAS) . Yang Wang, Dazheng Deng, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2022. PL-NPU: An energy-efficient edge-device DNN training processor with posit-based logarithm-domain computing. In IEEE Transactions on Circuits and Systems (TCAS)."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378514"},{"key":"e_1_3_2_1_62_1","volume-title":"Parana: A parallel neural architecture considering thermal problem of 3D stacked memory","author":"Yin Shouyi","year":"2018","unstructured":"Shouyi Yin , Shibin Tang , Xinhan Lin , Peng Ouyang , Fengbin Tu , Jishen Zhao , Cong Xu , Shuangcheng Li , Yuan Xie , ShaoJun Wei , 2018 . Parana: A parallel neural architecture considering thermal problem of 3D stacked memory . In IEEE Transactions on Parallel and Distributed Systems (TPDS) . Shouyi Yin, Shibin Tang, Xinhan Lin, Peng Ouyang, Fengbin Tu, Jishen Zhao, Cong Xu, Shuangcheng Li, Yuan Xie, ShaoJun Wei, 2018. Parana: A parallel neural architecture considering thermal problem of 3D stacked memory. In IEEE Transactions on Parallel and Distributed Systems (TPDS)."},{"key":"e_1_3_2_1_63_1","volume-title":"Conference on Neural Information Processing Systems (NIPS).","author":"Zhang Zhilu","year":"2018","unstructured":"Zhilu Zhang and Mert Sabuncu . 2018 . Generalized cross entropy loss for training deep neural networks with noisy labels . In Conference on Neural Information Processing Systems (NIPS). Zhilu Zhang and Mert Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In Conference on Neural Information Processing Systems (NIPS)."}],"event":{"name":"MICRO '23: 56th Annual IEEE\/ACM International Symposium on Microarchitecture","location":"Toronto ON Canada","acronym":"MICRO '23","sponsor":["SIGMICRO ACM Special Interest Group on Microarchitectural Research and Processing"]},"container-title":["56th Annual IEEE\/ACM International Symposium on Microarchitecture"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613424.3614299","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3613424.3614299","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:30Z","timestamp":1750178190000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3613424.3614299"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,28]]},"references-count":62,"alternative-id":["10.1145\/3613424.3614299","10.1145\/3613424"],"URL":"https:\/\/doi.org\/10.1145\/3613424.3614299","relation":{},"subject":[],"published":{"date-parts":[[2023,10,28]]},"assertion":[{"value":"2023-12-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}