{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T14:39:14Z","timestamp":1774449554800,"version":"3.50.1"},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,1,31]],"date-time":"2022-01-31T00:00:00Z","timestamp":1643587200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Office of Science of the U.S. Department of Energy","award":["DE-AC05-00OR22725"],"award-info":[{"award-number":["DE-AC05-00OR22725"]}]},{"name":"National Science Foundation","award":["1814609"],"award-info":[{"award-number":["1814609"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD\u2019s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application\u2019s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.<\/jats:p>","DOI":"10.1145\/3505285","type":"journal-article","created":{"date-parts":[[2022,1,31]],"date-time":"2022-01-31T16:35:52Z","timestamp":1643646952000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Metrics and Design of an Instruction Roofline Model for AMD GPUs"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2914-1483","authenticated-orcid":false,"given":"Matthew","family":"Leinhauser","sequence":"first","affiliation":[{"name":"Center for Advanced Systems Understanding, and University of Delaware, Newark, Delaware, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1642-0459","authenticated-orcid":false,"given":"Ren\u00e9","family":"Widera","sequence":"additional","affiliation":[{"name":"Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Dresden, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3396-6154","authenticated-orcid":false,"given":"Sergei","family":"Bastrakov","sequence":"additional","affiliation":[{"name":"Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Dresden, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3844-3697","authenticated-orcid":false,"given":"Alexander","family":"Debus","sequence":"additional","affiliation":[{"name":"Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Dresden, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8258-3881","authenticated-orcid":false,"given":"Michael","family":"Bussmann","sequence":"additional","affiliation":[{"name":"Center for Advanced Systems Understanding, and Helmholtz-ZentrumDresden-Rossendorf Laboratory, Dresden, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3560-9428","authenticated-orcid":false,"given":"Sunita","family":"Chandrasekaran","sequence":"additional","affiliation":[{"name":"University of Delaware, Newark, Delaware, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,1,31]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Frontier Center for Accelerated Application Readiness (CAAR). 2019. Retrieved 11 Mar 2021 from https:\/\/www.olcf.ornl.gov\/caar\/frontier-caar\/."},{"key":"e_1_3_1_3_2","unstructured":"The Top500 List. 2020. Retrieved from https:\/\/www.top500.org\/."},{"key":"e_1_3_1_4_2","unstructured":"Alpaka Code Repository. 2021. Retrieved from https:\/\/github.com\/alpaka-group\/alpaka."},{"key":"e_1_3_1_5_2","unstructured":"AMD-Instruction-Roofline-using-rocProf-Metrics Code Repository. 2021. Retrieved from https:\/\/github.com\/Techercise\/AMD-Instruction-Roofline-using-rocProf-Metrics."},{"key":"e_1_3_1_6_2","unstructured":"NERSC Roofline-on-NVIDIA-GPUs Code Repository. 2021. Retrieved from https:\/\/gitlab.com\/NERSC\/roofline-on-nvidia-gpus."},{"key":"e_1_3_1_7_2","unstructured":"NVIDIA Nsight Compute. 2021. Retrieved from https:\/\/developer.nvidia.com\/nsight-compute."},{"key":"e_1_3_1_8_2","unstructured":"NVIDIA Nsight Systems. 2021. Retrieved from https:\/\/developer.nvidia.com\/nsight-systems."},{"key":"e_1_3_1_9_2","unstructured":"NVIDIA NVProf. 2021. Retrieved from https:\/\/docs.nvidia.com\/cuda\/profiler-users-guide\/index.html#nvprof-overview."},{"key":"e_1_3_1_10_2","unstructured":"ROC-profiler. 2021. Retrieved from https:\/\/github.com\/ROCm-Developer-Tools\/rocprofiler."},{"key":"e_1_3_1_11_2","unstructured":"Paul Bauman Noel Chalmers Nick Curtis Chip Freitag Joe Greathouse Nicholas Malaya Damon McDougall Scott Moe Ren\u00e9 van Oostrum and Noah Wolfe. 2019. Intro to AMD GPU Programming with HIP."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2504564"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46079-6_34"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevX.9.031044"},{"key":"e_1_3_1_15_2","unstructured":"AM Devices. 2012. AMD Graphics Cores Next (GCN) Architecture (whitepaper)."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/PMBS49563.2019.00007"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2013.6"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2016.56"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.2172\/1761619"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-67630-2_36"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Neil A. Mehta Rahulkumar Gayatri Yasaman Ghadar Christopher Knight and Jack Deslippe. 2020. Evaluating performance portability of OpenMP for SNAP on NVIDIA Intel and AMD GPUs using the Roofline Methodology. In Proceedings of the WACCPD@ SC .","DOI":"10.1007\/978-3-030-74224-9_1"},{"key":"e_1_3_1_22_2","volume-title":"Quantitative Performance Assessment of Proxy Apps and Parentsreport for ecp Proxy App Project Milestone adcd-504-9","author":"Richards D. F.","year":"2020","unstructured":"D. F. Richards, O. Aaziz, J. Cook, S. Moore, D. Pruitt, and C. Vaughan. 2020. Quantitative Performance Assessment of Proxy Apps and Parentsreport for ecp Proxy App Project Milestone adcd-504-9. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1201\/b10509-10"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5547"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46079-6_21"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3505285","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3505285","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:24Z","timestamp":1750182564000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3505285"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,31]]},"references-count":25,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.1145\/3505285"],"URL":"https:\/\/doi.org\/10.1145\/3505285","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,1,31]]},"assertion":[{"value":"2021-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-31","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}