{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:26:12Z","timestamp":1750307172082,"version":"3.41.0"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2012,4,1]],"date-time":"2012-04-01T00:00:00Z","timestamp":1333238400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000143","name":"Division of Computing and Communication Foundations","doi-asserted-by":"publisher","award":["CCF-0936700"],"award-info":[{"award-number":["CCF-0936700"]}],"id":[{"id":"10.13039\/100000143","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","award":["HR0011-10-9-0008"],"award-info":[{"award-number":["HR0011-10-9-0008"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Syst."],"published-print":{"date-parts":[[2012,4]]},"abstract":"<jats:p>Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler\u2019s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.<\/jats:p>","DOI":"10.1145\/2166879.2166882","type":"journal-article","created":{"date-parts":[[2012,5,1]],"date-time":"2012-05-01T13:43:38Z","timestamp":1335879818000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":27,"title":["A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors"],"prefix":"10.1145","volume":"30","author":[{"given":"Mark","family":"Gebhart","sequence":"first","affiliation":[{"name":"The University of Texas at Austin"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel R.","family":"Johnson","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David","family":"Tarjan","sequence":"additional","affiliation":[{"name":"NVIDIA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stephen W.","family":"Keckler","sequence":"additional","affiliation":[{"name":"NVIDIA and The University of Texas at Austin"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"William J.","family":"Dally","sequence":"additional","affiliation":[{"name":"NVIDIA and Stanford University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Erik","family":"Lindholm","sequence":"additional","affiliation":[{"name":"NVIDIA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kevin","family":"Skadron","sequence":"additional","affiliation":[{"name":"University of Virginia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2012,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/325164.325119"},{"key":"e_1_2_1_2_1","unstructured":"AMD. 2010. ATI Stream Computing OpenCL Programming Guide. http:\/\/developer.amd.com\/gpu\/ATIStreamSDK\/assets\/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf. AMD . 2010. ATI Stream Computing OpenCL Programming Guide. http:\/\/developer.amd.com\/gpu\/ATIStreamSDK\/assets\/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf."},{"key":"e_1_2_1_3_1","unstructured":"AMD. 2011. HD 6900 series instruction set architecture. http:\/\/developer.amd.com\/gpu\/amdappsdk\/assets\/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf. AMD . 2011. HD 6900 series instruction set architecture. http:\/\/developer.amd.com\/gpu\/amdappsdk\/assets\/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:IJPP.0000004510.66751.2e"},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174","author":"Bakhoda A.","key":"e_1_2_1_5_1","unstructured":"Bakhoda , A. , Yuan , G. L. , Fung , W. W. L. , Wong , H. , and Aamodt , T. M . 2009. Analyzing CUDA workloads using a detailed gpu simulator . In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174 . Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed gpu simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 163--174."},{"volume-title":"Proceedings of the International Symposium on Microarchitecture. 237--248","author":"Balasubramonian R.","key":"e_1_2_1_6_1","unstructured":"Balasubramonian , R. , Dwarkadas , S. , and Albonesi , D. H . 2001. Reducing the complexity of the register file in dynamic superscalar processors . In Proceedings of the International Symposium on Microarchitecture. 237--248 . Balasubramonian, R., Dwarkadas, S., and Albonesi, D. H. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the International Symposium on Microarchitecture. 237--248."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2009.45"},{"volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture. 299--310","author":"Borch E.","key":"e_1_2_1_8_1","unstructured":"Borch , E. , Tune , E. , Manne , S. , and Emer , J . 2002. Loose loops sink chips . In Proceedings of the International Symposium on High Performance Computer Architecture. 299--310 . Borch, E., Tune, E., Manne, S., and Emer, J. 2002. Loose loops sink chips. In Proceedings of the International Symposium on High Performance Computer Architecture. 299--310."},{"volume-title":"Proceedings of the International Symposium on Microarchitecture. 27--36","author":"Brekelbaum E.","key":"e_1_2_1_9_1","unstructured":"Brekelbaum , E. , Rupley , J. , Wilkerson , C. , and Black , B . 2002. Hierarchical scheduling windows . In Proceedings of the International Symposium on Microarchitecture. 27--36 . Brekelbaum, E., Rupley, J., Wilkerson, C., and Black, B. 2002. Hierarchical scheduling windows. In Proceedings of the International Symposium on Microarchitecture. 27--36."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375527.1375541"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/291069.291010"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000079"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339708"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1048935.1050187"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854318"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859647"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1400181.1400197"},{"volume-title":"Proceedings of the International Symposium on Microarchitecture. 236--245","author":"Franklin M.","key":"e_1_2_1_19_1","unstructured":"Franklin , M. and Sohi , G. S . 1992. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors . In Proceedings of the International Symposium on Microarchitecture. 236--245 . Franklin, M. and Sohi, G. S. 1992. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors. In Proceedings of the International Symposium on Microarchitecture. 236--245."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2010.121"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/266021.266192"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815998"},{"volume-title":"Proceedings of the Workshop on Complexity-Effective Design.","author":"Hu Z.","key":"e_1_2_1_23_1","unstructured":"Hu , Z. and Martonosi , M . 2000. Reducing register file power consumption by exploiting value lifetime characteristics . In Proceedings of the Workshop on Complexity-Effective Design. Hu, Z. and Martonosi, M. 2000. Reducing register file power consumption by exploiting value lifetime characteristics. In Proceedings of the Workshop on Complexity-Effective Design."},{"key":"e_1_2_1_24_1","unstructured":"ITRS. 2009. International Technology Roadmap for Semiconductors. http:\/\/itrs.net\/links\/2009ITRS\/Home2009.htm. ITRS . 2009. International Technology Roadmap for Semiconductors. http:\/\/itrs.net\/links\/2009ITRS\/Home2009.htm."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1596510.1596511"},{"volume-title":"2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13","author":"Kogge P.","key":"e_1_2_1_26_1","unstructured":"Kogge , P. , Ed. 2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13 , University of Notre Dame . Kogge, P., Ed. 2008. ExaScale computing study: Technology challenges in achieving exascale systems. Tech. rep. TR-2008-13, University of Notre Dame."},{"volume-title":"Proceedings of the International Symposium on Computer Architecture. 59--70","author":"Lebeck A. R.","key":"e_1_2_1_27_1","unstructured":"Lebeck , A. R. , Koppanalil , J. , Li , T. , Patwardhan , J. , and Rotenberg , E . 2002. A large, fast instruction window for tolerating cache misses . In Proceedings of the International Symposium on Computer Architecture. 59--70 . Lebeck, A. R., Koppanalil, J., Li, T., Patwardhan, J., and Rotenberg, E. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the International Symposium on Computer Architecture. 59--70."},{"volume-title":"Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562","author":"Leon A. S.","key":"e_1_2_1_28_1","unstructured":"Leon , A. S. , Langley , B. , and Shin , J. L . 2007. The UltraSPARC T1 processor: CMT reliability . In Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562 . Leon, A. S., Langley, B., and Shin, J. L. 2007. The UltraSPARC T1 processor: CMT reliability. In Proceedings of the IEEE Custom Integrated Circuits Conference. 555--562."},{"key":"e_1_2_1_29_1","unstructured":"MAGMA. MAGMA: Matrix Algebra for GPU and Multicore Architectures. http:\/\/icl.eecs.utk.edu\/magma. MAGMA. MAGMA: Matrix Algebra for GPU and Multicore Architectures. http:\/\/icl.eecs.utk.edu\/magma."},{"key":"e_1_2_1_30_1","unstructured":"Muralimanohar N. Balasubramonian R. and Jouppi N. P. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. HP Laboratories. Muralimanohar N. Balasubramonian R. and Jouppi N. P. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. HP Laboratories."},{"volume-title":"Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304","author":"Nuth P. R.","key":"e_1_2_1_31_1","unstructured":"Nuth , P. R. and Dally , W. J . 1991. A mechanism for efficient context switching . In Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304 . Nuth, P. R. and Dally, W. J. 1991. A mechanism for efficient context switching. In Proceedings of the International Conference on Computer Design on VLSI in Computer & Processors. 301--304."},{"volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture. 4--13","author":"Nuth P. R.","key":"e_1_2_1_32_1","unstructured":"Nuth , P. R. and Dally , W. J . 1995. The named-state register file: Implementation and performance . In Proceedings of the International Symposium on High Performance Computer Architecture. 4--13 . Nuth, P. R. and Dally, W. J. 1995. The named-state register file: Implementation and performance. In Proceedings of the International Symposium on High Performance Computer Architecture. 4--13."},{"key":"e_1_2_1_33_1","unstructured":"NVIDIA. 2008. Compute Unified Device Architecture Programming Guide Version 2.0. http:\/\/developer.download.nvidia.com\/compute\/cuda\/2_0\/docs\/NVIDIA_CUDA_Programming_Guide_2.0.pdf. NVIDIA. 2008. Compute Unified Device Architecture Programming Guide Version 2.0. http:\/\/developer.download.nvidia.com\/compute\/cuda\/2_0\/docs\/NVIDIA_CUDA_Programming_Guide_2.0.pdf."},{"key":"e_1_2_1_34_1","unstructured":"NVIDIA. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. http:\/\/nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. NVIDIA. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. http:\/\/nvidia.com\/content\/PDF\/fermi_white_papers\/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf."},{"key":"e_1_2_1_35_1","volume-title":"PTX: Parallel Thread Execution ISA Version 2.3","author":"NVIDIA.","year":"2011","unstructured":"NVIDIA. 2011 . PTX: Parallel Thread Execution ISA Version 2.3 . http:\/\/developer.download.nvidia.com\/compute\/cuda\/4_0_rc2\/toolkit\/ docs\/ptx_isa_2.3.pdf. NVIDIA. 2011. PTX: Parallel Thread Execution ISA Version 2.3. http:\/\/developer.download.nvidia.com\/compute\/cuda\/4_0_rc2\/toolkit\/ docs\/ptx_isa_2.3.pdf."},{"key":"e_1_2_1_36_1","unstructured":"Parboil. Parboil Benchmark Suite. http:\/\/impact.crhc.illinois.edu\/parboil.php. Parboil . Parboil Benchmark Suite. http:\/\/impact.crhc.illinois.edu\/parboil.php."},{"volume-title":"Proceedings of the International Symposium on Microarchitecture. 171--182","author":"Park I.","key":"e_1_2_1_37_1","unstructured":"Park , I. , Powell , M. D. , and Vijaykumar , T. N . 2002. Reducing register ports for higher speed and lower energy . In Proceedings of the International Symposium on Microarchitecture. 171--182 . Park, I., Powell, M. D., and Vijaykumar, T. N. 2002. Reducing register ports for higher speed and lower energy. In Proceedings of the International Symposium on Microarchitecture. 171--182."},{"key":"e_1_2_1_38_1","unstructured":"Park J. and Dally W. J. 2011. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Tech. rep. Concurrent VLSI Architecture Group Memo 127 Stanford University. Park J. and Dally W. J. 2011. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Tech. rep. Concurrent VLSI Architecture Group Memo 127 Stanford University."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1134650.1134675"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture. 318--329","author":"Raasch S. E.","key":"e_1_2_1_40_1","unstructured":"Raasch , S. E. , Binkert , N. L. , and Reinhardt , S. K . 2002. A scalable instruction queue design using dependence chains . In Proceedings of the International Symposium on Computer Architecture. 318--329 . Raasch, S. E., Binkert, N. L., and Reinhardt, S. K. 2002. A scalable instruction queue design using dependence chains. In Proceedings of the International Symposium on Computer Architecture. 318--329."},{"volume-title":"Proceedings of the International Symposium on High Performance Computer Architecture. 375--386","author":"Rixner S.","key":"e_1_2_1_41_1","unstructured":"Rixner , S. , Dally , W. , Khailany , B. , Mattson , P. , Kapasi , U. , and Owens , J . 2000. Register organization for media processing . In Proceedings of the International Symposium on High Performance Computer Architecture. 375--386 . Rixner, S., Dally, W., Khailany, B., Mattson, P., Kapasi, U., and Owens, J. 2000. Register organization for media processing. In Proceedings of the International Symposium on High Performance Computer Architecture. 375--386."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/359327.359336"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1399504.1360617"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.43"},{"key":"e_1_2_1_45_1","unstructured":"Smith R. 2011. AMD Radeon HD 7970 review: 28nm and graphics core next together as one. www.anandtech.com\/show\/5261\/amd-radeon-hd-7970-review. Smith R. 2011. AMD Radeon HD 7970 review: 28nm and graphics core next together as one. www.anandtech.com\/show\/5261\/amd-radeon-hd-7970-review."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/55364.55398"},{"volume-title":"Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382","author":"Tseng J. H.","key":"e_1_2_1_47_1","unstructured":"Tseng , J. H. and Asanovic , K . 2000. Energy-efficient register access . In Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382 . Tseng, J. H. and Asanovic, K. 2000. Energy-efficient register access. In Proceedings of the Symposium on Integrated Circuits and Systems Design. 377--382."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2004.8"},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246","author":"Wong H.","key":"e_1_2_1_49_1","unstructured":"Wong , H. , Papadopoulou , M.-M. , Sadooghi-Alvandi , M. , and Moshovos , A . 2010. Demystifying GPU microarchitecture through microbenchmarking . In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246 . Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 235--246."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/377792.377861"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000094"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/360128.360143"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:IJPP.0000042082.31819.6d"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/1165573.1165633"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/1120725.1120979"},{"volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278","author":"Zhuang X.","key":"e_1_2_1_56_1","unstructured":"Zhuang , X. and Pande , S . 2003. Resolving register bank conflicts for a network processor . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278 . Zhuang, X. and Pande, S. 2003. Resolving register bank conflicts for a network processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 269--278."}],"container-title":["ACM Transactions on Computer Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2166879.2166882","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2166879.2166882","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T09:54:47Z","timestamp":1750240487000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2166879.2166882"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,4]]},"references-count":56,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2012,4]]}},"alternative-id":["10.1145\/2166879.2166882"],"URL":"https:\/\/doi.org\/10.1145\/2166879.2166882","relation":{},"ISSN":["0734-2071","1557-7333"],"issn-type":[{"type":"print","value":"0734-2071"},{"type":"electronic","value":"1557-7333"}],"subject":[],"published":{"date-parts":[[2012,4]]},"assertion":[{"value":"2011-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2011-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-04-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}