{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T16:42:04Z","timestamp":1761324124076,"version":"3.41.0"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2017,11,7]],"date-time":"2017-11-07T00:00:00Z","timestamp":1510012800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004663","name":"Ministry of Science and Technology of Taiwan","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004663","id-type":"DOI","asserted-by":"crossref"}]},{"name":"MediaTek Inc., Hsinchu, Taiwan"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2018,3,31]]},"abstract":"<jats:p>\n            A modern GPU can simultaneously process thousands of hardware threads. These threads are grouped into fixed-size SIMD batches executing the same instruction on vectors of data in a lockstep to achieve high throughput and performance. The register files are huge due to each SIMD group accessing a dedicated set of vector registers for fast context switching, and consequently the power consumption of register files has become an important issue. One proposed solution is to replace some of the vector registers by scalar registers, as different threads in a same SIMD group operate on scalar values and so the redundant computations and accesses of these scalar values can be eliminated. However, it has been observed that a significant number of registers containing affine vectors \u03c5 such that \u03c5[\n            <jats:italic>i<\/jats:italic>\n            ] =\n            <jats:italic>b<\/jats:italic>\n            +\n            <jats:italic>i<\/jats:italic>\n            \u00d7\n            <jats:italic>s<\/jats:italic>\n            can be represented by base\n            <jats:italic>b<\/jats:italic>\n            and stride\n            <jats:italic>s<\/jats:italic>\n            . Therefore, this article proposes an affine register file design for GPUs that is energy efficient due to it reducing the redundant executions of both the uniform and affine vectors. This design uses a pair of registers to store the base and stride of each affine vector and provides specific affine ALUs to execute affine instructions. A method of compiler analysis has been developed to detect scalars and affine vectors and annotate instructions for facilitating their corresponding scalar and affine computations. Furthermore, a priority-based register allocation scheme has been implemented to assign scalars and affine vectors to appropriate scalar and affine register files. Experimental results show that this design was able to dispatch 43.56% of the computations to scalar and affine ALUs when using eight scalar and four affine registers per warp. This resulted in the current design also reducing the energy consumption of the register files and ALUs to 21.86% and 26.54%, respectively, and it reduced the overall energy consumption of the GPU by an average of 5.18%.\n          <\/jats:p>","DOI":"10.1145\/3133218","type":"journal-article","created":{"date-parts":[[2017,11,8]],"date-time":"2017-11-08T13:20:33Z","timestamp":1510147233000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files"],"prefix":"10.1145","volume":"23","author":[{"given":"Shao-Chung","family":"Wang","sequence":"first","affiliation":[{"name":"National Tsing-Hua University, Hsinchu, Taiwan"}]},{"given":"Li-Chen","family":"Kan","sequence":"additional","affiliation":[{"name":"MediaTek Inc., Hsinchu, Taiwan"}]},{"given":"Chao-Lin","family":"Lee","sequence":"additional","affiliation":[{"name":"National Tsing-Hua University, Hsinchu, Taiwan"}]},{"given":"Yuan-Shin","family":"Hwang","sequence":"additional","affiliation":[{"name":"National Taiwan University of Science and Technology, Taipei, Taiwan"}]},{"given":"Jenq-Kuen","family":"Lee","sequence":"additional","affiliation":[{"name":"National Tsing-Hua University, Hsinchu, Taiwan"}]}],"member":"320","published-online":{"date-parts":[[2017,11,7]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522337"},{"volume-title":"White Paper. Retrieved","year":"2012","author":"AMD.","key":"e_1_2_1_2_1"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2013.6557173"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/88616.88621"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14122-5_8"},{"volume-title":"Technical Report. Retrieved","year":"2011","author":"Collange Sylvain","key":"e_1_2_1_8_1"},{"volume-title":"Proceedings of Google Summer of Code (GSoC\u201909)","year":"2009","author":"Dominguez Rodrigo","key":"e_1_2_1_9_1"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000093"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155675"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522330"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815998"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485952"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485934"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"volume-title":"Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913)","year":"2013","author":"Lee Yunsup","key":"e_1_2_1_17_1"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2744769.2744785"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2611758"},{"key":"e_1_2_1_21_1","unstructured":"Daniel Moth. 2012. A code-based introduction to C++ AMP. MSDN Magazine-Louisville (April) 28.  Daniel Moth. 2012. A code-based introduction to C++ AMP. MSDN Magazine-Louisville (April) 28."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1365490.1365500"},{"volume-title":"Whitepaper. Retrieved","year":"2009","author":"NVIDIA.","key":"e_1_2_1_23_1"},{"volume-title":"Whitepaper. Retrieved","year":"2012","key":"e_1_2_1_24_1"},{"volume-title":"Retrieved","year":"2016","author":"NVIDIA.","key":"e_1_2_1_25_1"},{"volume-title":"Retrieved","year":"2016","author":"NVIDIA.","key":"e_1_2_1_26_1"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips Tutorial. 1--41. Available at https:\/\/www.hotchips.org\/.  Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips Tutorial. 1--41. Available at https:\/\/www.hotchips.org\/.","DOI":"10.1109\/HOTCHIPS.2013.7478286"},{"key":"e_1_2_1_28_1","article-title":"Divergence analysis","volume":"35","author":"Sampaio Diogo","year":"2014","journal-title":"ACM Trans. Program. Lang. Syst."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2827697"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-32820-6_85"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/4.509850"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465022"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.21"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.22"},{"volume-title":"Proceedings of the 16th Workshop on Compilers for Parallel Computing (CPC\u201912)","year":"2012","author":"You Yi-Ping","key":"e_1_2_1_36_1"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000094"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3133218","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3133218","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:10:59Z","timestamp":1750212659000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3133218"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,11,7]]},"references-count":37,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2018,3,31]]}},"alternative-id":["10.1145\/3133218"],"URL":"https:\/\/doi.org\/10.1145\/3133218","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"type":"print","value":"1084-4309"},{"type":"electronic","value":"1557-7309"}],"subject":[],"published":{"date-parts":[[2017,11,7]]},"assertion":[{"value":"2017-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-11-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}