{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T15:30:12Z","timestamp":1780673412949,"version":"3.54.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"Institute of Information and communications Technology Planning and Evaluation","award":["2020-0-01305"],"award-info":[{"award-number":["2020-0-01305"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>\n            Modern deep neural networks (DNNs) are widely utilized across a broad range of domains, scaling rapidly and often comprising hundreds of diverse layers with varying types and configurations. To accelerate DNN execution, specialized hardware solutions, known as neural processing units (NPUs), have been developed. However, this heterogeneity of layers in a DNN model may cause performance degradation on NPUs. For example, while a layer\u2019s execution or dataflow is generally associated with a specific data access order, the data layout in on-chip memory may not be well aligned with it, introducing bubble cycles for layout reordering. Given the hundreds of diverse layers in DNNs, this\n            <jats:italic toggle=\"yes\">layout reordering<\/jats:italic>\n            overhead presents a new challenge for achieving efficient end-to-end DNN inference on NPUs.\n          <\/jats:p>\n          <jats:p>\n            To address this problem, this article introduces HopScotch, a holistic approach to data layout-aware mapping of DNNs on NPUs. First, HopScotch adopts a routing interconnect between the on-chip memory and the systolic array utilizing three-input multiplexers, paired with an on-chip programmable vector processor to manage arbitrary data layout reordering at runtime. Additionally, it introduces a tailored data layout to accommodate a variety of convolutional configurations within the proposed microarchitecture. Second, HopScotch presents a novel layout mapping solver that employs a top-k selection strategy based on a beam search algorithm, facilitating the efficient exploration of the vast layout mapping space at compile time. Third, the proposed layout mapping solver is integrated into the HopScotch mapping framework (HMF) to explore the layout mapping space and evaluate the resulting performance. Experiments with popular DNN models show that HopScotch reduces layout reordering costs by up to 98.2% and 90.3%, resulting in speedups of 2.62\u00d7 and 1.64\u00d7 in end-to-end latency, compared to XLA and\n            <jats:italic toggle=\"yes\">\n              GCD\n              <jats:sup>2<\/jats:sup>\n            <\/jats:italic>\n            , respectively.\n          <\/jats:p>","DOI":"10.1145\/3711821","type":"journal-article","created":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T07:39:23Z","timestamp":1750837163000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["HopScotch: A Holistic Approach to Data Layout-Aware Mapping on NPUs for High-Performance DNN Inference"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-3639-9223","authenticated-orcid":false,"given":"Suhong","family":"Lee","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering, Seoul National University","place":["Gwanak-gu, Korea (the Republic of)"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4855-3600","authenticated-orcid":false,"given":"Boyeal","family":"Kim","sequence":"additional","affiliation":[{"name":"Seoul National University","place":["Gwanak-gu, Korea (the Republic of)"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yongseok","family":"Choi","sequence":"additional","affiliation":[{"name":"SAPEON Korea","place":["Korea (the Republic of)"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6811-9647","authenticated-orcid":false,"given":"Hyuk-Jae","family":"Lee","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, Seoul National University","place":["Gwanak-gu, Korea (the Republic of)"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,9,19]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http:\/\/tensorflow.org\/Software available from tensorflow.org."},{"key":"e_1_3_2_3_2","volume-title":"Compilers Principles, Techniques and Tools","author":"Alfred V. Aho","year":"2007","unstructured":"V. Aho Alfred, S. Lam Monica, and D. Ullman Jeffrey. 2007. Compilers Principles, Techniques and Tools. Pearson Education."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD50377.2020.00036"},{"key":"e_1_3_2_5_2","unstructured":"AWS. 2024. Inferentia2 Architecture. Retrieved from https:\/\/awsdocs-neuron.readthedocs-hosted.com\/en\/latest\/index.html"},{"key":"e_1_3_2_6_2","unstructured":"Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey E Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https:\/\/arxiv.org\/abs\/1607.06450"},{"key":"e_1_3_2_7_2","article-title":"ONNX: Open Neural Network Exchange [Online]","author":"Bai Junjie","year":"2024","unstructured":"Junjie Bai, Fang Lu, and Ke Zhang. 2024. ONNX: Open Neural Network Exchange [Online]. Available: Retrieved from https:\/\/github.com\/onnx\/onnx","journal-title":"Available: Retrieved from https:\/\/github.com\/onnx\/onnx"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485137"},{"key":"e_1_3_2_9_2","first-page":"579","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA). USENIX Association, USA, 579\u2013594."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.5555\/2821589"},{"key":"e_1_3_2_12_2","unstructured":"Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_13_2","first-page":"11963","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ding Xiaohan","year":"2022","unstructured":"Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. 2022. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 11963\u201311975."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01352"},{"key":"e_1_3_2_15_2","article-title":"RepVGG: Making VGG-style ConvNets Great Again [Online]","author":"Ding Xiaohan","year":"2024","unstructured":"Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. 2024. RepVGG: Making VGG-style ConvNets Great Again [Online]. Available: Retrieved from https:\/\/github.com\/DingXiaoH\/RepVGG","journal-title":"Available: Retrieved from https:\/\/github.com\/DingXiaoH\/RepVGG"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2872887.2750389"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589348"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640365"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582018"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_21_2","unstructured":"Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv:1606.08415. Retrieved from http:\/\/arxiv.org\/abs\/1606.08415"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2019.2913372"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2022.3198246"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00010"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2024.3421553"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2025.3556956"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2023.3335949"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358252"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO57630.2024.10444871"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3533042"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.5555\/3358807.3358895"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00062"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2023.3337208"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2975185"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC53511.2021.00028"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2022.3153288"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2019.2905242"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00044"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3058217"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","unstructured":"Chunmyung Park Jicheon Kim Eunjae Hyun Xuan Truong Nguyen and Hyuk-Jae Lee. 2025. Leveraging hot data in a multi-tenant accelerator for effective shared memory management. In Proceedings of the 2025 Design Automation and Test in Europe Conference. 7 pages. DOI:DOI:10.23919\/DATE64628.2025.10992845","DOI":"10.23919\/DATE64628.2025.10992845"},{"key":"e_1_3_2_44_2","volume-title":"PyTorch: An Imperative Style, High-performance Deep Learning Library","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-performance Deep Learning Library. Curran Associates Inc., NY, USA."},{"issue":"8","key":"e_1_3_2_45_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_46_2","article-title":"Xla: Compiling machine learning for peak performance","author":"Sabne Amit","year":"2020","unstructured":"Amit Sabne. 2020. Xla: Compiling machine learning for peak performance. Google Res (2020).","journal-title":"Google Res"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00016"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00474"},{"key":"e_1_3_2_49_2","article-title":"Product of SAPEON - X330 [Online]","year":"2023","unstructured":"SAPEON. 2023. Product of SAPEON - X330 [Online]. Available: Retrieved from https:\/\/www.sapeon.com\/products\/sapeon-x330","journal-title":"Available: Retrieved from https:\/\/www.sapeon.com\/products\/sapeon-x330"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651324"},{"key":"e_1_3_2_51_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CICC53496.2022.9772810"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2975764"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00293"},{"key":"e_1_3_2_55_2","series-title":"Proceedings of Machine Learning Research","first-page":"6105","volume-title":"Proceedings of the 36th International Conference on Machine Learning","volume":"97","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97).Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), PMLR, 6105\u20136114."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00024"},{"key":"e_1_3_2_57_2","unstructured":"Apache TVM. 2023. Apache TVM. Retrieved from https:\/\/tvm.apache.org\/docs\/"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_59_2","doi-asserted-by":"crossref","unstructured":"Ziheng Wang Jeremy Wohlwend and Tao Lei. 2019. Structured pruning of large language models. arXiv:1910.04732. Retrieved from https:\/\/arxiv.org\/abs\/1910.04732","DOI":"10.18653\/v1\/2020.emnlp-main.496"},{"key":"e_1_3_2_60_2","doi-asserted-by":"crossref","unstructured":"Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz et\u00a0al. 2019. Huggingface\u2019s transformers: State-of-the-art natural language processing. arXiv:1910.03771. Retrieved from https:\/\/arxiv.org\/abs\/1910.03771","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3424669"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460776"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASID.2018.8693202"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711821","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T00:48:05Z","timestamp":1758329285000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711821"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,19]]},"references-count":62,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3711821"],"URL":"https:\/\/doi.org\/10.1145\/3711821","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,19]]},"assertion":[{"value":"2024-12-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}