{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:43:53Z","timestamp":1755999833957,"version":"3.41.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2018,6,13]],"date-time":"2018-06-13T00:00:00Z","timestamp":1528848000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2018,6,13]]},"abstract":"<jats:p>Contemporary Graphics Processing Units (GPUs) are used to accelerate highly parallel compute workloads. For the last decade, researchers in academia and industry have used cycle-level GPU architecture simulators to evaluate future designs. This paper performs an in-depth analysis of commonly accepted GPU simulation methodology, examining the effect both the workload and the choice of instruction set architecture have on the accuracy of a widely-used simulation infrastructure, GPGPU-Sim. We analyze numerous aspects of the architecture, validating the simulation results against real hardware. Based on a characterized set of over 1700 GPU kernels, we demonstrate that while the relative accuracy of compute-intensive workloads is high, inaccuracies in modeling the memory system result in much higher error when memory performance is critical. We then perform a case study using a recently proposed GPU architecture modification, Cache-Conscious Wavefront Scheduling. The case study demonstrates that the cross-product of workload characteristics and instruction set architecture choice can affect the predicted efficacy of the technique.<\/jats:p>","DOI":"10.1145\/3224430","type":"journal-article","created":{"date-parts":[[2018,6,13]],"date-time":"2018-06-13T18:21:15Z","timestamp":1528914075000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["A Quantitative Evaluation of Contemporary GPU Simulation Methodology"],"prefix":"10.1145","volume":"2","author":[{"given":"Akshay","family":"Jain","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}]},{"given":"Mahmoud","family":"Khairy","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}]},{"given":"Timothy G.","family":"Rogers","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}]}],"member":"320","published-online":{"date-parts":[[2018,6,13]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2011. GPGPU-Sim 3.x manual. http:\/\/gpgpu-sim.org\/manual\/index.php\/Main_Page  2011. GPGPU-Sim 3.x manual. http:\/\/gpgpu-sim.org\/manual\/index.php\/Main_Page"},{"key":"e_1_2_1_2_1","unstructured":"2017. CUDA C Programming Guide. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html  2017. CUDA C Programming Guide. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_2_1_3_1","unstructured":"2018. CORREL function. https:\/\/support.office.com\/en-us\/article\/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92  2018. CORREL function. https:\/\/support.office.com\/en-us\/article\/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92"},{"key":"e_1_2_1_4_1","unstructured":"2018. PTX ISA :: CUDA Toolkit Documentation. http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html  2018. PTX ISA :: CUDA Toolkit Documentation. http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html"},{"key":"e_1_2_1_5_1","unstructured":"AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http:\/\/www.gem5.org\/wiki\/ images\/f\/fd\/AMD_gem5_APU_simulator_micro_2015_final.pptx  AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http:\/\/www.gem5.org\/wiki\/ images\/f\/fd\/AMD_gem5_APU_simulator_micro_2015_final.pptx"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522302"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/268806.268810"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/MASCOTS.2010.43"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1735688.1735702"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854318"},{"key":"e_1_2_1_15_1","first-page":"464","article-title":"Cache and associated method with frame buffer managed dirty data pull and high-priority clean mechanism","volume":"8","author":"Edmondson John H","year":"2013","journal-title":"US Patent"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.37"},{"key":"e_1_2_1_17_1","unstructured":"HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http:\/\/www.hsafoundation.com\/ standards\/  HSA Foundation. 2016. HSA Standards to Bring About the Next Level of Innovation. http:\/\/www.hsafoundation.com\/ standards\/"},{"key":"e_1_2_1_18_1","unstructured":"Wilson WL Fung Ivan Sham George Yuan and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO.  Wilson WL Fung Ivan Sham George Yuan and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2017.7975298"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"volume-title":"24th IEEE International Symposium on High-Performance Computer Architecture (HPCA)","year":"2018","author":"Gutierrez Anthony","key":"e_1_2_1_21_1"},{"key":"e_1_2_1_22_1","first-page":"929","article-title":"Analysis of x86 instruction set usage for DOS\/Windows applications and its implication on superscalar design","volume":"85","author":"Huang Jer","year":"2002","journal-title":"IEICE Transactions on Information and Systems"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304582"},{"key":"e_1_2_1_24_1","unstructured":"Zhe Jia Marco Maggioni Benjamin Staiger and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).  Zhe Jia Marco Maggioni Benjamin Staiger and Daniele P Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv preprint arXiv:1804.06826 (2018)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_26_1","first-page":"834","article-title":"Operand collector architecture","volume":"7","author":"Liu Samuel","year":"2010","journal-title":"US Patent"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2017.7975297"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2549523"},{"key":"e_1_2_1_29_1","unstructured":"Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011).  Paulius Micikevicius. 2011. Local memory and register spilling. NVIDIA Corporation (2011)."},{"volume-title":"High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 37--48","author":"Nugteren Cedric","key":"e_1_2_1_30_1"},{"key":"e_1_2_1_31_1","unstructured":"NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http:\/\/www.nvidia.com\/content\/ PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .  NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. http:\/\/www.nvidia.com\/content\/ PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf ."},{"key":"e_1_2_1_32_1","unstructured":"NVIDIA. 2011. CUDA C\/C++ SDK Code Samples. http:\/\/developer.nvidia.com\/cuda-cc-sdk-code-samples.  NVIDIA. 2011. CUDA C\/C++ SDK Code Samples. http:\/\/developer.nvidia.com\/cuda-cc-sdk-code-samples."},{"key":"e_1_2_1_33_1","unstructured":"NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com\/content\/PDF\/ kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012).  NVIDIA. 2012. NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. nvidia.com\/content\/PDF\/ kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. (2012)."},{"key":"e_1_2_1_34_1","unstructured":"NVIDIA. 2015. Pascal L1 cache. https:\/\/devtalk.nvidia.com\/default\/topic\/1006066\/pascal-l1-cache\/.  NVIDIA. 2015. Pascal L1 cache. https:\/\/devtalk.nvidia.com\/default\/topic\/1006066\/pascal-l1-cache\/."},{"key":"e_1_2_1_35_1","unstructured":"NVIDIA. 2016. Pascal P100. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper. pdf.  NVIDIA. 2016. Pascal P100. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper. pdf."},{"key":"e_1_2_1_36_1","unstructured":"NVIDIA. 2016. Pascal P102. https:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_ GTX_1080_Whitepaper_FINAL.pdf.  NVIDIA. 2016. Pascal P102. https:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_ GTX_1080_Whitepaper_FINAL.pdf."},{"key":"e_1_2_1_37_1","unstructured":"NVIDIA. 2017. Pascal Titan X. https:\/\/www.nvidia.com\/en-us\/geforce\/products\/10series\/titan-x-pascal\/.  NVIDIA. 2017. Pascal Titan X. https:\/\/www.nvidia.com\/en-us\/geforce\/products\/10series\/titan-x-pascal\/."},{"key":"e_1_2_1_38_1","unstructured":"NVIDIA. 2017. Pascal Tuning. https:\/\/www.olcf.ornl.gov\/wp-content\/uploads\/2017\/01\/SummitDev_Pascal-Tuning.pdf.  NVIDIA. 2017. Pascal Tuning. https:\/\/www.olcf.ornl.gov\/wp-content\/uploads\/2017\/01\/SummitDev_Pascal-Tuning.pdf."},{"key":"e_1_2_1_39_1","unstructured":"University of British Columbia. 2018. GPGPU-Sim Public Github. https:\/\/github.com\/gpgpu-sim\/gpgpu-sim_ distribution\/tree\/dev.  University of British Columbia. 2018. GPGPU-Sim Public Github. https:\/\/github.com\/gpgpu-sim\/gpgpu-sim_ distribution\/tree\/dev."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485963"},{"key":"e_1_2_1_42_1","unstructured":"JEDEC Standard. 2013. GDDR5X. JESD232A (2013).  JEDEC Standard. 2013. GDDR5X. JESD232A (2013)."},{"key":"e_1_2_1_43_1","unstructured":"John A Stratton Christopher Rodrigues I-Jui Sung Nady Obeid Li-Wen Chang Nasser Anssari Geng Daniel Liu and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).  John A Stratton Christopher Rodrigues I-Jui Sung Nady Obeid Li-Wen Chang Nasser Anssari Geng Daniel Liu and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370865"},{"key":"e_1_2_1_45_1","unstructured":"Purdue University. 2018. GPGPU-Sim Correlation Project. https:\/\/engineering.purdue.edu\/tgrogers\/group\/correlator. html.  Purdue University. 2018. GPGPU-Sim Correlation Project. https:\/\/engineering.purdue.edu\/tgrogers\/group\/correlator. html."},{"key":"e_1_2_1_46_1","unstructured":"Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https:\/\/github.com\/tgrogers\/gpgpu-sim_ simulations.  Purdue University. 2018. GPGPU-Sim Simulations Github Repository. https:\/\/github.com\/tgrogers\/gpgpu-sim_ simulations."},{"key":"e_1_2_1_47_1","unstructured":"W.J. van der Laan. 2010. Decuda and cudasm the CUDA binary utilities package. https:\/\/github.com\/laanwj\/decuda  W.J. van der Laan. 2010. Decuda and cudasm the CUDA binary utilities package. https:\/\/github.com\/laanwj\/decuda"},{"volume-title":"Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing (SC '08)","author":"Volkov Vasily","key":"e_1_2_1_48_1"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2010.5452013"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3224430","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3224430","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:39:06Z","timestamp":1750210746000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3224430"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,6,13]]},"references-count":49,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2018,6,13]]}},"alternative-id":["10.1145\/3224430"],"URL":"https:\/\/doi.org\/10.1145\/3224430","relation":{},"ISSN":["2476-1249"],"issn-type":[{"type":"electronic","value":"2476-1249"}],"subject":[],"published":{"date-parts":[[2018,6,13]]},"assertion":[{"value":"2018-06-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}