{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:31:26Z","timestamp":1750221086212,"version":"3.41.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2018,9,24]],"date-time":"2018-09-24T00:00:00Z","timestamp":1537747200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,9,30]]},"abstract":"<jats:p>Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound and would benefit from a larger register file to enable more concurrent threads. This article seeks to increase the thread occupancy and improve performance of these register-bound applications by making more efficient use of the existing register file capacity. Our first technique eagerly deallocates register resources during execution. We show that releasing register resources based on value liveness as proposed in prior states of the art leads to unreliable performance and undue design complexity. To address these deficiencies, our article presents a novel compiler-driven approach that identifies and exploits last use of a register name (instead of the value contained within) to eagerly release register resources. Furthermore, while previous works have leveraged \u201cscalar\u201d and \u201cnarrow\u201d operand properties of a program for various optimizations, their impact on thread occupancy has been relatively unexplored. Our article evaluates the effectiveness of these techniques in improving thread occupancy and demonstrates that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work. An in-depth evaluation on a large suite of applications shows that just our early register technique outperforms previous work on dynamic register allocation, and together these approaches, on average, provide 12% performance speedup (23% higher thread occupancy) on register bound applications not already saturating other GPU resources.<\/jats:p>","DOI":"10.1145\/3243905","type":"journal-article","created":{"date-parts":[[2018,9,24]],"date-time":"2018-09-24T12:05:57Z","timestamp":1537790757000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Software-Directed Techniques for Improved GPU Register File Utilization"],"prefix":"10.1145","volume":"15","author":[{"given":"Dani","family":"Voitsechov","sequence":"first","affiliation":[{"name":"Technion, Haifa, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Arslan","family":"Zulfiqar","sequence":"additional","affiliation":[{"name":"NVIDIA, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mark","family":"Stephenson","sequence":"additional","affiliation":[{"name":"NVIDIA, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mark","family":"Gebhart","sequence":"additional","affiliation":[{"name":"NVIDIA, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6338-4297","authenticated-orcid":false,"given":"Stephen W.","family":"Keckler","sequence":"additional","affiliation":[{"name":"NVIDIA, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,9,24]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. Retrieved from https:\/\/www.amd.com\/Documents\/GCN_Architecture_whitepaper.pdf  AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. Retrieved from https:\/\/www.amd.com\/Documents\/GCN_Architecture_whitepaper.pdf"},{"key":"e_1_2_1_2_1","unstructured":"AMD. 2016. Dissecting the Polaris Architecture. Retrieved from http:\/\/radeon.wpengine.netdna-cdn.com\/wp-content\/uploads\/2016\/08\/Polaris-Architecture-Whitepaper-Final-08042016.pdf.  AMD. 2016. Dissecting the Polaris Architecture. Retrieved from http:\/\/radeon.wpengine.netdna-cdn.com\/wp-content\/uploads\/2016\/08\/Polaris-Architecture-Whitepaper-Final-08042016.pdf."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/520549.822763"},{"volume-title":"Proceedings of the European Conference on Parallel Processing. 969--979","author":"Budiu Mihai","key":"e_1_2_1_4_1"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402918"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_7_1","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. https:\/\/arxiv.org\/abs\/1410.0759. (2014).  Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. https:\/\/arxiv.org\/abs\/1410.0759. (2014)."},{"key":"e_1_2_1_8_1","unstructured":"R. Collobert J. Weston L. Bottou M. Karlen K. Kavukcuoglu and P. Kuksa. 2011. Natural language processing (almost) from scratch. https:\/\/arxiv.org\/abs\/1103.0398. (2011).  R. Collobert J. Weston L. Bottou M. Karlen K. Kavukcuoglu and P. Kuksa. 2011. Natural language processing (almost) from scratch. https:\/\/arxiv.org\/abs\/1103.0398. (2011)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2004.29"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000093"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155675"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.18"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522330"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201904)","author":"Gonzalez R.","key":"e_1_2_1_14_1"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2005.06.042"},{"volume-title":"James D. Foley, and Steven K. Feiner.","year":"2014","author":"Hughes John F.","key":"e_1_2_1_16_1"},{"key":"e_1_2_1_17_1","unstructured":"Itseez. 2015. Open Source Computer Vision Library. Retrieved from https:\/\/github.com\/itseez\/opencv.  Itseez. 2015. Open Source Computer Vision Library. Retrieved from https:\/\/github.com\/itseez\/opencv."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830784"},{"volume-title":"Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED\u201913)","author":"Jing N.","key":"e_1_2_1_19_1"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195655"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2005.14"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/290940.290985"},{"volume-title":"Design and Evaluation of Register Allocation on GPUs. Master\u2019s thesis","author":"Kalra Charu","key":"e_1_2_1_23_1"},{"volume-title":"Proceedings of the International Conference on Neural Information Processing Systems (NIPS\u201916)","author":"Krizhevsky A.","key":"e_1_2_1_24_1"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750417"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2013.6494995"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807606"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201915)","author":"Li D.","key":"e_1_2_1_28_1"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/998680.1006728"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.798316"},{"volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO). 125--135","author":"Martin Milo M.","key":"e_1_2_1_32_1"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/850943.853097"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/255235.255288"},{"volume-title":"Advanced Compiler Design and Implementation","author":"Muchnick Steven S.","key":"e_1_2_1_35_1"},{"key":"e_1_2_1_36_1","unstructured":"NVIDIA. 2012. Kepler GK110. Retrieved from http:\/\/www.nvidia.co.uk\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.  NVIDIA. 2012. Kepler GK110. Retrieved from http:\/\/www.nvidia.co.uk\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf."},{"key":"e_1_2_1_37_1","unstructured":"NVIDIA. 2016. CUDA Occupancy Calculator. Retrieved from http:\/\/developer.download.nvidia.com\/compute\/cuda\/CUDA_Occupancy_calculator.xls.  NVIDIA. 2016. CUDA Occupancy Calculator. Retrieved from http:\/\/developer.download.nvidia.com\/compute\/cuda\/CUDA_Occupancy_calculator.xls."},{"key":"e_1_2_1_38_1","unstructured":"NVIDIA. 2016. NVIDIA GeForce GTX 1080 Whitepaper. Retrieved from http:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_1080_Whitepaper_FINAL.pdf.  NVIDIA. 2016. NVIDIA GeForce GTX 1080 Whitepaper. Retrieved from http:\/\/international.download.nvidia.com\/geforce-com\/international\/pdfs\/GeForce_GTX_1080_Whitepaper_FINAL.pdf."},{"key":"e_1_2_1_39_1","unstructured":"NVIDIA. 2016. NVIDIA Tesla P100. Retrieved from https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf.  NVIDIA. 2016. NVIDIA Tesla P100. Retrieved from https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecture-whitepaper.pdf."},{"key":"e_1_2_1_40_1","unstructured":"NVIDIA. 2016. Visual Profiler User\u2019s Guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/profiler-users-guide.  NVIDIA. 2016. Visual Profiler User\u2019s Guide. Retrieved from http:\/\/docs.nvidia.com\/cuda\/profiler-users-guide."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349317"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750375"},{"volume-title":"Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01","year":"2012","author":"Stratton John A.","key":"e_1_2_1_43_1"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201916)","author":"Wong D.","key":"e_1_2_1_44_1"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/360128.360143"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1120725.1120979"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3243905","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3243905","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:57:48Z","timestamp":1750208268000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3243905"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,9,24]]},"references-count":46,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,9,30]]}},"alternative-id":["10.1145\/3243905"],"URL":"https:\/\/doi.org\/10.1145\/3243905","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2018,9,24]]},"assertion":[{"value":"2018-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-09-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}