{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T16:00:34Z","timestamp":1779292834277,"version":"3.51.4"},"reference-count":120,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,14]],"date-time":"2023-12-14T00:00:00Z","timestamp":1702512000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100009023","name":"PRESTO","doi-asserted-by":"crossref","award":["JPMJPR20MA"],"award-info":[{"award-number":["JPMJPR20MA"]}],"id":[{"id":"10.13039\/501100009023","id-type":"DOI","asserted-by":"crossref"}]},{"name":"New Energy and Industrial Technology Development Organization"},{"name":"AIST\/TokyoTech Real-world Big-Data Computation Open Innovation Laboratory"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5\u00a0nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56\u00d7 for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.<\/jats:p>","DOI":"10.1145\/3629520","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T21:37:02Z","timestamp":1698269822000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5343-414X","authenticated-orcid":false,"given":"Jens","family":"Domke","sequence":"first","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7494-5048","authenticated-orcid":false,"given":"Emil","family":"Vatai","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8585-6031","authenticated-orcid":false,"given":"Balazs","family":"Gerofi","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5787-0363","authenticated-orcid":false,"given":"Yuetsu","family":"Kodama","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7165-2095","authenticated-orcid":false,"given":"Mohamed","family":"Wahib","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5452-6794","authenticated-orcid":false,"given":"Artur","family":"Podobas","sequence":"additional","affiliation":[{"name":"KTH Royal Institute of Technology, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2908-993X","authenticated-orcid":false,"given":"Sparsh","family":"Mittal","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology, Roorkee, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7583-6609","authenticated-orcid":false,"given":"Miquel","family":"Peric\u00e0s","sequence":"additional","affiliation":[{"name":"Chalmers University of Technology, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2452-1551","authenticated-orcid":false,"given":"Lingqi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tokyo Institute of Technology, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1244-3151","authenticated-orcid":false,"given":"Peng","family":"Chen","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4575-7213","authenticated-orcid":false,"given":"Aleksandr","family":"Drozd","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1910-8532","authenticated-orcid":false,"given":"Satoshi","family":"Matsuoka","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,14]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2023. Heterogeneous Integration Roadmap 2023 Edition - Chapter 20: Thermal . Technical Report. IEEE Electronics Packaging Society. 1\u201339. https:\/\/eps.ieee.org\/images\/files\/HIR_2023\/ch20_thermalfinal.pdf"},{"key":"e_1_3_2_3_2","unstructured":"Andreas Abel and Jan Reineke. 2021. A Parametric Microarchitecture Model for Accurate Basic Block Throughput Prediction on Recent Intel CPUs. https:\/\/arxiv.org\/pdf\/2107.14210.pdf"},{"key":"e_1_3_2_4_2","unstructured":"Advanced Micro Devices Inc. 2021. AMD Instinct\u2122 MI250X Accelerator. https:\/\/www.amd.com\/en\/products\/server-accelerators\/instinct-mi250x"},{"key":"e_1_3_2_5_2","unstructured":"ADVENTURE Project. 2019. Development of Computational Mechanics System for Large Scale Analysis and Design \u2014 ADVENTURE Project. https:\/\/adventure.sys.t.u-tokyo.ac.jp\/"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2002.807414"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2917698"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/PMBS49563.2019.00012"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.6512"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1021\/ct400203a"},{"key":"e_1_3_2_11_2","unstructured":"Argonne National Laboratory. 2022. NEK5000. http:\/\/nek5000.mcs.anl.gov"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1145\/125826.125925","volume-title":"Proceedings of the 1991 ACM\/IEEE Conference on Supercomputing (SC\u201991)","author":"Bailey D. H.","year":"1991","unstructured":"D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks \u2013 summary and preliminary results. In Proceedings of the 1991 ACM\/IEEE Conference on Supercomputing (SC\u201991). ACM, New York, NY, USA, 158\u2013165. 10.1145\/125826.125925"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2014.55"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.18"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1007\/978-1-5041-2940-4_9","volume-title":"Proceedings of the IFIP TC2\/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement","author":"Boisvert Ronald F.","year":"1997","unstructured":"Ronald F. Boisvert, Roldan Pozo, Karin Remington, Richard F. Barrett, and Jack J. Dongarra. 1997. Matrix market: A web resource for test matrix collections. In Proceedings of the IFIP TC2\/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement. Chapman & Hall, Ltd., London, UK, 125\u2013137. http:\/\/dl.acm.org\/citation.cfm?id=265834.265854"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.22323\/1.164.0188"},{"key":"e_1_3_2_18_2","unstructured":"Gavin Bonshor. 2022. AMD Releases Milan-X CPUs With 3D V-Cache: EPYC 7003 Up to 64 Cores and 768 MB L3 Cache. https:\/\/www.anandtech.com\/show\/17323\/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.01.003"},{"key":"e_1_3_2_20_2","first-page":"95","volume-title":"Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA\u201911)","author":"Cabezas Victoria Caparr\u00f3s","year":"2011","unstructured":"Victoria Caparr\u00f3s Cabezas and Phillip Stanley-Marbell. 2011. Parallelism and data movement characterization of contemporary application classes. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA\u201911). Association for Computing Machinery, New York, NY, USA, 95\u2013104. 10.1145\/1989493.1989506"},{"issue":"4","key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1109\/TSUSC.2018.2823542","article-title":"Analysing the role of last level caches in controlling chip temperature","volume":"3","author":"Chakraborty Shounak","year":"2018","unstructured":"Shounak Chakraborty and Hemangee K. Kapoor. 2018. Analysing the role of last level caches in controlling chip temperature. IEEE Transactions on Sustainable Computing 3, 4 (2018), 289\u2013305.","journal-title":"IEEE Transactions on Sustainable Computing"},{"key":"e_1_3_2_22_2","unstructured":"Cheese. 2022. AMD\u2019s V-Cache Tested: The Latency Teaser. https:\/\/chipsandcheese.com\/2022\/01\/14\/amds-v-cache-tested-the-latency-teaser\/"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC47752.2019.9042166"},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1145\/2749469.2750387","volume-title":"Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA\u201915)","author":"Chou Chiachen","year":"2015","unstructured":"Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2015. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA\u201915). Association for Computing Machinery, New York, NY, USA, 198\u2013210. 10.1145\/2749469.2750387"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.2172\/1311761"},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1145\/3323439.3323988","volume-title":"Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems (SCOPES\u201919)","author":"Corda Stefano","year":"2019","unstructured":"Stefano Corda, Gagandeep Singh, Ahsan Javed Awan, Roel Jordans, and Henk Corporaal. 2019. Memory and parallelism analysis using a platform-independent approach. In Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems (SCOPES\u201919). Association for Computing Machinery, New York, NY, USA, 23\u201326. 10.1145\/3323439.3323988"},{"key":"e_1_3_2_27_2","unstructured":"Ian Cutress. 2021. AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB\/Sec. https:\/\/www.anandtech.com\/show\/16725\/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gaming"},{"key":"e_1_3_2_28_2","unstructured":"Ian Cutress. 2021. Did IBM Just Preview the Future of Caches?https:\/\/www.anandtech.com\/show\/16924\/did-ibm-just-preview-the-future-of-caches"},{"key":"e_1_3_2_29_2","unstructured":"William James Dally Carl Thomas Gray Stephen W. Keckler and James Michael O\u2019Connor. [n. d.]. Memory Stacked on Processor for High Bandwidth. https:\/\/patents.justia.com\/patent\/20230275068"},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"489","DOI":"10.1007\/978-3-319-46079-6_34","volume-title":"High Performance Computing","author":"Deakin Tom","year":"2016","unstructured":"Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In High Performance Computing, Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.). Springer, Cham, 489\u2013507."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.1974.1050511"},{"key":"e_1_3_2_32_2","first-page":"13","volume-title":"Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS\u201916)","author":"Dickson James","year":"2016","unstructured":"James Dickson, Steven Wright, Satheesh Maheswaran, Andy Herdmant, Mark C. Miller, and Stephen Jarvis. 2016. Replicating HPC I\/O Workloads with proxy applications. In Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS\u201916). IEEE Press, Piscataway, NJ, USA, 13\u201318. 10.1109\/PDSW-DISCS.2016.6"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1137\/120864672"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00019"},{"key":"e_1_3_2_35_2","unstructured":"Jens Domke and Emil Vatai. 2021. Matrix Engine Study. https:\/\/gitlab.com\/domke\/MEstudy"},{"key":"e_1_3_2_36_2","first-page":"1056","volume-title":"2021 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, Oregon, USA, May 17\u201321, 2021","author":"Domke Jens","year":"2021","unstructured":"Jens Domke, Emil Vatai, Aleksandr Drozd, Chen Peng, Yosuke Oyama, Lingqi Zhang, Shweta Salaria, Daichi Mukunoki, Artur Podobas, Mohamed Wahib, and Satoshi Matsuoka. 2021. Matrix engines for high performance computing: A paragon of performance or grasping at straws?. In 2021 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, Oregon, USA, May 17\u201321, 2021. IEEE Press, Portland, Oregon, USA, 1056\u20131065."},{"key":"e_1_3_2_37_2","first-page":"456","volume-title":"Proceedings of the 1st International Conference on Supercomputing","author":"Dongarra Jack","year":"1988","unstructured":"Jack Dongarra. 1988. The LINPACK benchmark: An explanation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 456\u2013474. http:\/\/dl.acm.org\/citation.cfm?id=647970.742568"},{"key":"e_1_3_2_38_2","volume-title":"HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems","author":"Dongarra Jack","year":"2015","unstructured":"Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report ut-eecs-15-736. University of Tennessee. https:\/\/library.eecs.utk.edu\/pub\/594"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1093\/nsr\/nwv084"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1145\/2000064.2000108","volume-title":"Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA\u201911)","author":"Esmaeilzadeh Hadi","year":"2011","unstructured":"Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA\u201911). Association for Computing Machinery, New York, NY, USA, 365\u2013376. 10.1145\/2000064.2000108"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2022.3152788"},{"key":"e_1_3_2_42_2","unstructured":"Exascale Computing Project. 2018. ECP Proxy Apps Suite. https:\/\/proxyapps.exascaleproject.org\/ecp-proxy-apps-suite\/"},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","first-page":"144","DOI":"10.1109\/ISSCC19947.2020.9062957","volume-title":"2020 IEEE International Solid- State Circuits Conference - (ISSCC)","author":"Gomes Wilfred","year":"2020","unstructured":"Wilfred Gomes, Sanjeev Khushu, Doug B. Ingerly, Patrick N. Stover, Nasirul I. Chowdhury, Frank O\u2019Mahony, Ajay Balankutty, Noam Dolev, Martin G. Dixon, Lei Jiang, Surya Prekke, Biswajit Patra, Pavel V. Rott, and Rajesh Kumar. 2020. 8.1 Lakefield and mobility compute: A 3D stacked 10nm and 22FFL hybrid processor system in 12\u00d712mm2, 1mm package-on-package. In 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE Press, San Francisco, CA, USA, 144\u2013146. 10.1109\/ISSCC19947.2020.9062957"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1983.1676201"},{"key":"e_1_3_2_45_2","first-page":"659","volume-title":"Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE\u201915)","author":"Goud A. Arun","year":"2015","unstructured":"A. Arun Goud, Rangharajan Venkatesan, Anand Raghunathan, and Kaushik Roy. 2015. Asymmetric underlapped FinFET based Robust SRAM design at 7nm node. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE\u201915). EDA Consortium, San Jose, CA, USA, 659\u2013664."},{"key":"e_1_3_2_46_2","first-page":"526","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201916)","author":"Grass Thomas","year":"2016","unstructured":"Thomas Grass, C\u00e9sar Allande, Adri\u00e0 Armejach, Alejandro Rico, Eduard Ayguad\u00e9, Jesus Labarta, Mateo Valero, Marc Casas, and Miquel Moreto. 2016. MUSA: A multi-level simulation approach for next-generation HPC machines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201916). IEEE Press, Salt Lake City, UT, USA, 526\u2013537. 10.1109\/SC.2016.44"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.11188\/seisankenkyu.58.11"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3015569"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2020.3029615"},{"key":"e_1_3_2_50_2","unstructured":"Nicole Hemsoth. 2018. A Rogues Gallery of Post-Moore\u2019s Law Options. https:\/\/www.nextplatform.com\/2018\/08\/27\/a-rogues-gallery-of-post-moores-law-options\/"},{"key":"e_1_3_2_51_2","volume-title":"Improving Performance via Mini-applications","author":"Heroux Michael A.","year":"2009","unstructured":"Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories."},{"key":"e_1_3_2_52_2","unstructured":"Joel Hruska. 2012. The Death of CPU Scaling: From One Core to Many \u2013 and Why We\u2019re Still Stuck. https:\/\/www.extremetech.com\/computing\/116561-the-death-of-cpu-scaling-from-one-core-to-many-and-why-were-still-stuck"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.011441561"},{"key":"e_1_3_2_54_2","first-page":"64","volume-title":"International Roadmap for Devices and Systems (IRDS\u2122) 2021 Edition \u2013 Executive Summary","author":"IRDS\u2122 IEEE","year":"2021","unstructured":"IEEE IRDS\u2122. 2021. International Roadmap for Devices and Systems (IRDS\u2122) 2021 Edition \u2013 Executive Summary. IEEE IRDS\u2122 Roadmap. IEEE. 64 pages. https:\/\/irds.ieee.org\/images\/files\/pdf\/2021\/2021IRDS_ES.pdf"},{"key":"e_1_3_2_55_2","first-page":"23","volume-title":"International Roadmap for Devices and Systems (IRDS\u2122) 2021 Edition \u2013 Systems and Architectures","author":"IRDS\u2122 IEEE","year":"2021","unstructured":"IEEE IRDS\u2122. 2021. International Roadmap for Devices and Systems (IRDS\u2122) 2021 Edition \u2013 Systems and Architectures. IEEE IRDS\u2122 Roadmap. IEEE. 23 pages. https:\/\/irds.ieee.org\/images\/files\/pdf\/2021\/2021IRDS_SA.pdf"},{"key":"e_1_3_2_56_2","unstructured":"Intel Corporation. 2012. Intel\u00ae Architecture Code Analyzer \u2013 User\u2019s Guide. https:\/\/www.intel.com\/content\/dam\/develop\/external\/us\/en\/documents\/intel-architecture-code-analyzer-2-0-users-guide-157548.pdf"},{"key":"e_1_3_2_57_2","unstructured":"Intel Corporation. 2020. Dynamic Control- Flow Graph (DCFG) and DCFG-Trace Format Specifications \u2013 For Format Version 1.00. https:\/\/www.intel.com\/content\/dam\/develop\/external\/us\/en\/documents\/dcfg-format-548994.pdf"},{"key":"e_1_3_2_58_2","unstructured":"Intel Corporation. 2021. Intel\u00ae Software Development Emulator. https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/articles\/tool\/software-development-emulator.html"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3114903"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS52781.2021.9567422"},{"key":"e_1_3_2_61_2","first-page":"26","volume-title":"The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance","author":"Jin H.","year":"1999","unstructured":"H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011. NASA Ames Research Center. 26 pages. https:\/\/www.nas.nasa.gov\/assets\/pdf\/techreports\/1999\/nas-99-011.pdf"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1002\/wcms.1220"},{"key":"e_1_3_2_63_2","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1145\/3368474.3368483","volume-title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia2020)","author":"Kodama Yuetsu","year":"2020","unstructured":"Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. 2020. Accuracy improvement of memory system simulation for modern shared memory processor. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia2020). Association for Computing Machinery, New York, NY, USA, 142\u2013149. 10.1145\/3368474.3368483"},{"key":"e_1_3_2_64_2","first-page":"315","volume-title":"Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Korgaonkar Kunal","year":"2018","unstructured":"Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young, and Hong Wang. 2018. Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA\u201918). IEEE Press, Los Angeles, CA, USA, 315\u2013327. 10.1109\/ISCA.2018.00035"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-1376-0_7"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/PMBS.2018.8641578"},{"key":"e_1_3_2_67_2","unstructured":"LLVM Project. 2022. Llvm-Mca - LLVM Machine Code Analyzer. https:\/\/llvm.org\/docs\/CommandGuide\/llvm-mca.html"},{"key":"e_1_3_2_68_2","first-page":"454","volume-title":"Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-44)","author":"Loh Gabriel H.","year":"2011","unstructured":"Gabriel H. Loh and Mark D. Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-44). Association for Computing Machinery, New York, NY, USA, 454\u2013464. 10.1145\/2155620.2155673"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2007.59"},{"key":"e_1_3_2_70_2","first-page":"29:1\u201329:16","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921)","author":"Ltaief Hatem","year":"2021","unstructured":"Hatem Ltaief, Jesse Cranney, Damien Gratadour, Yuxi Hong, Laurent Gatineau, and David E. Keyes. 2021. Meeting the real-time challenges of ground-based telescopes using low-rank matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201921). ACM, New York, NY, USA, 29:1\u201329:16. 10.1145\/3458817.3476225"},{"issue":"19","key":"e_1_3_2_71_2","first-page":"1","article-title":"Memory bandwidth and machine balance in current high performance computers","volume":"2","author":"McCalpin J. D.","year":"1995","unstructured":"J. D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture (TCCA) Newsletter 2, 19\u201325 (Dec. 1995), 1\u20137.","journal-title":"IEEE Technical Committee on Computer Architecture (TCCA) Newsletter"},{"key":"e_1_3_2_72_2","doi-asserted-by":"crossref","first-page":"162","DOI":"10.1145\/977091.977115","volume-title":"Proceedings of the 1st Conference on Computing Frontiers (CF\u201904)","author":"McKee Sally A.","year":"2004","unstructured":"Sally A. McKee. 2004. Reflections on the memory wall. In Proceedings of the 1st Conference on Computing Frontiers (CF\u201904). Association for Computing Machinery, New York, NY, USA, 162. 10.1145\/977091.977115"},{"key":"e_1_3_2_73_2","series-title":"Proceedings of the 36th International Conference on Machine Learning","first-page":"4505","volume":"97","author":"Mendis Charith","year":"2019","unstructured":"Charith Mendis, Alex Renda, Dr. Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 4505\u20134515. https:\/\/proceedings.mlr.press\/v97\/mendis19a.html"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2018.08.014"},{"key":"e_1_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2461155"},{"key":"e_1_3_2_76_2","volume-title":"Co-Design for Molecular Dynamics: An Exascale Proxy Application","author":"Mohd-Yusof Jamaludin","year":"2013","unstructured":"Jamaludin Mohd-Yusof, Sriram Swaminarayan, and Timothy C. Germann. 2013. Co-Design for Molecular Dynamics: An Exascale Proxy Application. Technical Report LA-UR 13-20839. Los Alamos National Laboratory. http:\/\/www.lanl.gov\/orgs\/adtsc\/publications\/science_highlights_2013\/docs\/Pg88_89.pdf"},{"key":"e_1_3_2_77_2","first-page":"11","article-title":"Progress in digital integrated electronics","volume":"21","author":"Moore Gordon E.","year":"1975","unstructured":"Gordon E. Moore. 1975. Progress in digital integrated electronics. International Electron Devices Meeting, IEEE 21 (1975), 11\u201313.","journal-title":"International Electron Devices Meeting, IEEE"},{"key":"e_1_3_2_78_2","unstructured":"Timothy P. Morgan. 2022. \u201cMilan-X\u201d 3D Vertical Cache Yields Epyc HPC Bang for the Buck Boost. https:\/\/www.nextplatform.com\/2022\/03\/21\/milan-x-3d-vertical-cache-yields-epyc-hpc-bang-for-the-buck-boost\/"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1002\/qua.24860"},{"key":"e_1_3_2_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.41"},{"key":"e_1_3_2_81_2","first-page":"96","volume-title":"Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Nori Anant Vithal","year":"2018","unstructured":"Anant Vithal Nori, Jayesh Gaur, Siddharth Rai, Sreenivas Subramoney, and Hong Wang. 2018. Criticality aware tiered cache hierarchy: A fundamental relook at multi-level cache hierarchies. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA\u201918). IEEE Press, Los Angeles, CA, USA, 96\u2013109. 10.1109\/ISCA.2018.00019"},{"key":"e_1_3_2_82_2","unstructured":"NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU. https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/"},{"key":"e_1_3_2_83_2","first-page":"9","volume-title":"Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption","author":"Okazaki Ryohei","year":"2020","unstructured":"Ryohei Okazaki, Takekazu Tabata, Sota Sakashita, Kenichi Kitamura, Noriko Takagi, Hideki Sakata, Takeshi Ishibashi, Takeo Nakamura, and Yuichiro Ajima. 2020. Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption. Fujitsu Technical Review. Fujitsu Limited. 9 pages. https:\/\/www.fujitsu.com\/global\/documents\/about\/resources\/publications\/technicalreview\/2020-03\/article03.pdf"},{"key":"e_1_3_2_84_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3110993"},{"key":"e_1_3_2_85_2","unstructured":"Kenji Ono Masako Iwata Tsuyoshi Tamaki Yasuhiro Kawashima Kei Akasaka Soichiro Suzuki Junya Onishi Ken Uzawa Kazuhiro Hamaguchi Yohei Miyazaki and Masashi Imano. 2016. FFV-C Package. http:\/\/avr-aics-riken.github.io\/ffvc_package\/"},{"key":"e_1_3_2_86_2","first-page":"1","volume-title":"2017 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S)","author":"Or-Bach Zvi","year":"2017","unstructured":"Zvi Or-Bach. 2017. A 1,000x improvement in computer systems by bridging the processor-memory gap. In 2017 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE Press, Burlingame, CA, USA, 1\u20134. 10.1109\/S3S.2017.8309202"},{"key":"e_1_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2008.917757"},{"key":"e_1_3_2_88_2","first-page":"54:1\u201354:12","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915)","author":"Park Jongsoo","year":"2015","unstructured":"Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, and Pradeep Dubey. 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915). ACM, Austin, TX, USA, 54:1\u201354:12. 10.1145\/2807591.2807603"},{"key":"e_1_3_2_89_2","volume-title":"User\u2019s Guide to SW4, Version 2.0","author":"Petersson N. A.","year":"2017","unstructured":"N. A. Petersson and B. Sj\u00f6green. 2017. User\u2019s Guide to SW4, Version 2.0. Technical Report LLNL-SM-741439. Lawrence Livermore National Laboratory."},{"key":"e_1_3_2_90_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3012084"},{"key":"e_1_3_2_91_2","unstructured":"Louis-Noel Pouchet and Mark Taylor. 2016. PolyBench\/C 4.2.1 (Beta). https:\/\/sourceforge.net\/projects\/polybench\/"},{"key":"e_1_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00045"},{"key":"e_1_3_2_93_2","unstructured":"RIKEN AICS. 2015. Fiber Miniapp Suite. https:\/\/fiber-miniapp.github.io\/"},{"key":"e_1_3_2_94_2","unstructured":"RIKEN Center for Computational Science. 2021. The Kernel Codes from Priority Issue Target Applications. https:\/\/github.com\/RIKEN-RCCS\/fs2020-tapp-kernels"},{"key":"e_1_3_2_95_2","unstructured":"RIKEN-RCCS. 2020. Riken_simulator. https:\/\/github.com\/RIKEN-RCCS\/riken_simulator"},{"key":"e_1_3_2_96_2","first-page":"190","volume-title":"Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS\u201912)","author":"Rodrigues Arun","year":"2012","unstructured":"Arun Rodrigues, Elliot Cooper-Balis, Keren Bergman, Kurt Ferreira, David Bunde, and K. Scott Hemmert. 2012. Improvements to the structural simulation toolkit. In Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS\u201912). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, 190\u2013195."},{"key":"e_1_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433763"},{"key":"e_1_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2022.3224421"},{"key":"e_1_3_2_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2020.3037892"},{"key":"e_1_3_2_100_2","unstructured":"Anton Shilov. 2022. TSMC Roadmap Update: N3E in 2024 N2 in 2026 Major Changes Incoming. https:\/\/www.anandtech.com\/show\/17356\/tsmc-roadmap-update-n3e-in-2024-n2-in-2026-major-changes-incoming"},{"key":"e_1_3_2_101_2","first-page":"1197","volume-title":"Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE\u201915)","author":"Shulaker Max M.","year":"2015","unstructured":"Max M. Shulaker, Tony F. Wu, Mohamed M. Sabry, Hai Wei, H.-S. Philip Wong, and Subhasish Mitra. 2015. Monolithic 3D integration: A path from concept to reality. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE\u201915). EDA Consortium, San Jose, CA, USA, 1197\u20131202."},{"key":"e_1_3_2_102_2","unstructured":"Hugh Sorby. 2017. MPI Stub. https:\/\/github.com\/hsorby\/mpistub"},{"key":"e_1_3_2_103_2","unstructured":"Standard Performance Evaluation Corporation. 2020. SPEC\u2019s Benchmarks. https:\/\/www.spec.org\/benchmarks.html"},{"key":"e_1_3_2_104_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.35"},{"key":"e_1_3_2_105_2","unstructured":"Erich Strohmaier Jack Dongarra Horst Simon and Martin Meuer. 2021. TOP500. http:\/\/www.top500.org\/"},{"key":"e_1_3_2_106_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2974217"},{"key":"e_1_3_2_107_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijheatmasstransfer.2016.02.010"},{"key":"e_1_3_2_108_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2017.29"},{"issue":"6","key":"e_1_3_2_109_2","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1016\/j.fluiddyn.2004.03.003","article-title":"A new dynamical framework of nonhydrostatic global model using the icosahedral grid","volume":"34","author":"Tomita Hirofumi","year":"2004","unstructured":"Hirofumi Tomita and Masaki Satoh. 2004. A new dynamical framework of nonhydrostatic global model using the icosahedral grid. Fluid Dynamics Research 34, 6 (2004), 357\u2013400. http:\/\/stacks.iop.org\/1873-7005\/34\/i=6\/a=A03","journal-title":"Fluid Dynamics Research"},{"key":"e_1_3_2_110_2","doi-asserted-by":"publisher","DOI":"10.11484\/jaea-conf-2014-003"},{"key":"e_1_3_2_111_2","first-page":"8","volume-title":"The NAS Parallel Benchmarks 2.4","author":"Wijngaart Rob F. Van der","year":"2002","unstructured":"Rob F. Van der Wijngaart. 2002. The NAS Parallel Benchmarks 2.4. Technical Report NAS-02-007. NASA Ames Research Center. 8 pages. https:\/\/www.nas.nasa.gov\/assets\/pdf\/techreports\/2002\/nas-02-007.pdf"},{"key":"e_1_3_2_112_2","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1109\/ASAP.2017.7995254","volume-title":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","author":"Vasudevan Aravind","year":"2017","unstructured":"Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE Press, Seattle, WA, USA, 19\u201324. 10.1109\/ASAP.2017.7995254"},{"key":"e_1_3_2_113_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.3211127"},{"key":"e_1_3_2_114_2","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1145\/2989081.2989116","volume-title":"Proceedings of the Second International Symposium on Memory Systems (MEMSYS\u201916)","author":"Voskuilen Gwendolyn","year":"2016","unstructured":"Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. 2016. Analyzing allocation behavior for multi-level memory. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS\u201916). Association for Computing Machinery, New York, NY, USA, 204\u2013207. 10.1145\/2989081.2989116"},{"key":"e_1_3_2_115_2","doi-asserted-by":"publisher","DOI":"10.3390\/mi9060287"},{"key":"e_1_3_2_116_2","first-page":"1","volume-title":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","author":"Warnock James","year":"2015","unstructured":"James Warnock, Brian Curran, John Badar, Gregory Fredeman, Donald Plass, Yuen Chan, Sean Carey, Gerard Salem, Friedrich Schroeder, Frank Malgioglio, Guenter Mayer, Christopher Berry, Michael Wood, Yiu-Hing Chan, Mark Mayo, John Isakson, Charudhattan Nagarajan, Tobias Werner, Leon Sigal, Ricardo Nigaglioni, Mark Cichanowski, Jeffrey Zitz, Matthew Ziegler, Tim Bronson, Gerald Strevig, Daniel Dreps, Ruchir Puri, Douglas Malone, Dieter Wendel, Pak-Kin Mak, and Michael Blake. 2015. 4.1 22nm Next-generation IBM system z microprocessor. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. IEEE Press, San Francisco, CA, USA, 1\u20133. 10.1109\/ISSCC.2015.7062930"},{"key":"e_1_3_2_117_2","first-page":"1","volume-title":"2015 IEEE High Performance Extreme Computing Conference (HPEC\u201915)","author":"Wolf M. M.","year":"2015","unstructured":"M. M. Wolf, J. W. Berry, and D. T. Stark. 2015. A task-based linear algebra building blocks approach for scalable graph analytics. In 2015 IEEE High Performance Extreme Computing Conference (HPEC\u201915). IEEE Press, Waltham, MA, USA, 1\u20136. 10.1109\/HPEC.2015.7322450"},{"key":"e_1_3_2_118_2","first-page":"352","volume-title":"IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20\u201326, 2022","author":"Yamamura Shuji","year":"2022","unstructured":"Shuji Yamamura, Yasunobu Akizuki, Hideyuki Sekiguchi, Takumi Maruyama, Tsutomu Sano, Hiroyuki Miyazaki, and Toshio Yoshida. 2022. A64FX: 52-core processor designed for the 442PetaFLOPS Supercomputer Fugaku. In IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20\u201326, 2022. IEEE, San Francisco, CA, USA, 352\u2013354. 10.1109\/ISSCC42614.2022.9731627"},{"key":"e_1_3_2_119_2","first-page":"22","volume-title":"2018 IEEE Hot Chips 30 Symposium (HCS)","author":"Yoshida Toshio","year":"2018","unstructured":"Toshio Yoshida. 2018. Fujitsu high performance CPU for the Post-K computer. In 2018 IEEE Hot Chips 30 Symposium (HCS). IEEE Computer Society, California, USA, 22. http:\/\/www.fujitsu.com\/jp\/Images\/20180821hotchips30.pdf"},{"key":"e_1_3_2_120_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00036"},{"key":"e_1_3_2_121_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2014.03.007"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3629520","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3629520","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:01Z","timestamp":1750178161000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3629520"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,14]]},"references-count":120,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3629520"],"URL":"https:\/\/doi.org\/10.1145\/3629520","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,14]]},"assertion":[{"value":"2022-12-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-13","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}