{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T21:09:08Z","timestamp":1780088948550,"version":"3.54.0"},"reference-count":81,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/100004358","name":"Samsung Electronics Company Ltd.","doi-asserted-by":"crossref","award":["IO201209-07887-01"],"award-info":[{"award-number":["IO201209-07887-01"]}],"id":[{"id":"10.13039\/100004358","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["RS-2024-00414964"],"award-info":[{"award-number":["RS-2024-00414964"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2026,5,29]]},"abstract":"<jats:p>The demands of large-scale workloads have driven the evolution of GPUs, placing them in the mainstream scope of computing architectures. To design an effective GPU for large-scale workloads, a trustworthy simulator is required to evaluate performance and explore the design space. Additionally, GPU simulators must be fast enough to evaluate architectural modifications for large-scale workloads within a reasonable time. However, existing GPU simulators suffer from long execution times due to detailed component simulation, limiting their utility for evaluating the effects of architectural modifications on large-scale workloads. It is necessary to improve the performance of a GPU simulator such that quick architecture exploration and evaluation for large-scale workloads are available, at the expense of accuracy.<\/jats:p>\n                  <jats:p>This paper presents LPGSim, a trace-driven and cycle-level GPU simulator. LPGSim aims to provide fast and accurate GPU simulation. To this end, LPGSim first eliminates instruction metadata that has minimal impact on simulation accuracy. This approach enables coalescing compute instructions while preserving the dependencies to the preceding memory instruction. As a consequence, the core pipeline is simplified, improving simulation speed at a marginal cost in accuracy. Next, LPGSim parallelizes GPU simulation. LPGSim partitions a GPU architecture into three parallelizable subsystems and introduces local-clock-based parallelization to reduce the overhead of global synchronization and sequential execution paths in existing methods. LPGSim further employs parallelization methods to achieve scalability on NUMA systems with an acceptable trade-off in accuracy. Our evaluation shows a modest decrease in simulation accuracy in return for a substantial improvement in simulation speed. In terms of accuracy, LPGSim achieves 21.4%, 23.3%, and 22.7% errors across three different GPU architectures. In terms of speed, the single-threaded LPGSim simulation yields a 9.97x speedup over the state-of-the-art GPU simulator. Parallelization achieves a 19.8x speedup on a two-socket 56-thread system, for a total 197.4x speedup over the state-of-the-art simulator. Parallelization incurs an additional accuracy loss, but our experiments indicate that simulation is still reliable for large-scale workloads.<\/jats:p>","DOI":"10.1145\/3805642","type":"journal-article","created":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T20:34:18Z","timestamp":1780086858000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["LPGSim: A Lightweight Parallel GPU Simulator Maximizing Speed with Trustworthy Simulation"],"prefix":"10.1145","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7774-8028","authenticated-orcid":false,"given":"Hyunwoo","family":"Nam","sequence":"first","affiliation":[{"name":"Yonsei University, Seoul, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1975-2363","authenticated-orcid":false,"given":"Jay Hwan","family":"Lee","sequence":"additional","affiliation":[{"name":"Yonsei University, Seoul, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9437-4665","authenticated-orcid":false,"given":"Yeonsoo","family":"Kim","sequence":"additional","affiliation":[{"name":"Yonsei University, Seoul, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1500-9652","authenticated-orcid":false,"given":"Mengzhao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Yonsei University, Seoul, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0264-0979","authenticated-orcid":false,"given":"Jeonggeun","family":"Kim","sequence":"additional","affiliation":[{"name":"Kyungpook National University, Daegu, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0374-8853","authenticated-orcid":false,"given":"Bernd","family":"Burgstaller","sequence":"additional","affiliation":[{"name":"Yonsei University, Seoul, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,5,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2016604.2016637"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2016.7482092"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1465482.1465560"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476221"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480100"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2009.4919648"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3559009.3569666"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_9_1","volume-title":"USIMM: the Utah SImulated Memory Module","author":"Chatterjee Niladrish","year":"2012","unstructured":"Niladrish Chatterjee, Rajeev Balasubramonian, Manjunath Shevgoor, Seth Pugsley, Aniruddha Udipi, Ali Shafiee, Kshitij Sudan, Manu Awasthi, and Zeshan Chishti. 2012. USIMM: the Utah SImulated Memory Module. University of Utah, Tech. Rep (2012), 1-24."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/iiswc.2009.5306797"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/mm.2021.3061394"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/mascots.2010.43"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-46077-7_12"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2006.1620807"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/pmbs49563.2019.00007"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/hpca.2019.00061"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322224"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/mm.2024.3373763"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/iccd.2017.84"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2017.7975298"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/inpar.2012.6339595"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass57527.2023.00013"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3068281"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555775"},{"key":"e_1_2_1_25_1","volume-title":"Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024","author":"Hu Qinghao","year":"2024","unstructured":"Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. 2024. Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024, Laurent Vanbever and Irene Zhang (Eds.). USENIX Association, 709-729. https:\/\/www.usenix.org\/conference\/nsdi24\/presentation\/hu"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/micro.2014.59"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589236.3589244"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass64960.2025.00054"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3224430"},{"key":"e_1_2_1_30_1","volume-title":"HIGH BANDWIDTH MEMORY (HBM) DRAM. https:\/\/www.jedec.org\/standards-documents\/docs\/jesd235a\/. [Online","author":"D.","year":"2026","unstructured":"JESD235D. 2021. HIGH BANDWIDTH MEMORY (HBM) DRAM. https:\/\/www.jedec.org\/standards-documents\/docs\/jesd235a\/. [Online; accessed 7-January-2026]."},{"key":"e_1_2_1_31_1","volume-title":"Graphics Double Data Rate (GDDR6) SGRAM Standar. https:\/\/www.jedec.org\/standards-documents\/docs\/jesd250d\/. [Online","author":"D.","year":"2026","unstructured":"JESD250D. 2023. Graphics Double Data Rate (GDDR6) SGRAM Standar. https:\/\/www.jedec.org\/standards-documents\/docs\/jesd250d\/. [Online; accessed 7-January-2026]."},{"key":"e_1_2_1_32_1","volume-title":"Volta GPU, Architecture via Microbenchmarking. CoRR","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA, Volta GPU, Architecture via Microbenchmarking. CoRR, Vol. abs\/1804.06826 (2018). arXiv:1804.06826 http:\/\/arxiv.org\/abs\/1804.06826"},{"key":"e_1_2_1_33_1","volume-title":"Architecture Through Microbenchmarking. GTC 2021","author":"Jia Zhe","year":"2025","unstructured":"Zhe Jia, Peter Van Sandt, Marco Maggioni, Jeffrey Smith, and Daniele P. Scarpazza. 2021. Dissecting the Ampere GPU, Architecture Through Microbenchmarking. GTC 2021. https:\/\/www.nvidia.com\/en-us\/on-demand\/session\/gtcspring21-s33322\/ Accessed: 2025-07-09."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480063"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/isca45697.2020.00047"},{"key":"e_1_2_1_36_1","volume-title":"MacSim: A CPU-GPU Heterogeneous Simulation Framework User Guide","author":"Kim Hyesoon","year":"2012","unstructured":"Hyesoon Kim, Jaekyu Lee, Nagesh B Lakshminarayana, Jaewoong Sim, Jieun Lim, and Tri Pho. 2012. MacSim: A CPU-GPU Heterogeneous Simulation Framework User Guide. Georgia Institute of Technology (2012), 1-57."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527384"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2013.6557151"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2508148.2485964"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2019.00028"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/lca.2020.2973991"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/lca.2023.3333759"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/hipc.2014.7116897"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/511334.511349"},{"key":"e_1_2_1_45_1","unstructured":"Meta. 2026. Zstandard. https:\/\/github.com\/facebook\/zstd. Fast real-time lossless compression algorithm Accessed: 2026-03-20."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.101635"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.23919\/date54114.2022.9774614"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass57527.2023.00030"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.19260722"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/hpca.2014.6835955"},{"key":"e_1_2_1_51_1","unstructured":"NVIDIA. 2014. CUDA Deep neural network library (cuDNN). https:\/\/developer.nvidia.com\/cuDNN Accessed: 2026-01-10."},{"key":"e_1_2_1_52_1","unstructured":"NVIDIA. 2017. Nvidia Tesla V100 Gpu Architecture. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf Accessed: 2025-07-09."},{"key":"e_1_2_1_53_1","unstructured":"NVIDIA. 2026 a. Tuning CUDA Applications for Ampere. https:\/\/docs.nvidia.com\/cuda\/ampere-tuning-guide\/index.html. Accessed: 2026-03-26."},{"key":"e_1_2_1_54_1","unstructured":"NVIDIA. 2026 b. Tuning CUDA Applications for Volta. https:\/\/docs.nvidia.com\/cuda\/volta-tuning-guide\/index.html. Accessed: 2026-03-26."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3563697"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3656012"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024724.2024954"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/lca.2015.2402435"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/lca.2014.2299539"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass64960.2025.00022"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2019.00016"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/isca45697.2020.00045"},{"key":"e_1_2_1_63_1","unstructured":"Baidu Research. 2016. DeepBench. https:\/\/github.com\/baidu-research\/DeepBench Accessed: 2025-07-09."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2508148.2485963"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/191995.192072"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/pact.2001.953283"},{"key":"e_1_2_1_67_1","unstructured":"John A Stratton Christopher Rodrigues I-Jui Sung Nady Obeid Li-Wen Chang Nasser Anssari Geng Daniel Liu and Wen-mei W Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report 7.2. University of Illinois at Urbana-Champaign Center for Reliable and High-Performance Computing."},{"key":"e_1_2_1_68_1","volume-title":"Architecture. GTC 2020","author":"Thomas-Collignon Guillaume","year":"2020","unstructured":"Guillaume Thomas-Collignon and Vishal Mehta. 2020. Optimizing Applications for NVIDIA, Ampere GPU, Architecture. GTC 2020. https:\/\/developer.nvidia.com\/gtc\/2020\/video\/s21819 Accessed: 2025-07-09."},{"key":"e_1_2_1_69_1","volume-title":"TOP500 Supercomputer Site. https:\/\/www.top500.org\/. [Online","year":"2025","unstructured":"TOP500.org. 2025. TOP500 Supercomputer Site. https:\/\/www.top500.org\/. [Online; accessed 6-January-2025]."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/hpca51647.2021.00077"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358307"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/ispass.2014.6844466"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/micro50266.2020.00085"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/mm.2006.79"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/mm.2021.3097287"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/hpca.2015.7056063"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/micro61859.2024.00022"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3638757"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpds.2017.2700307"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/hpca.2011.5749745"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.23919\/date56975.2023.10137178"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3805642","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T20:39:23Z","timestamp":1780087163000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805642"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,29]]},"references-count":81,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,5,29]]}},"alternative-id":["10.1145\/3805642"],"URL":"https:\/\/doi.org\/10.1145\/3805642","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,29]]},"assertion":[{"value":"2026-05-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}