{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:11:22Z","timestamp":1750306282113,"version":"3.41.0"},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2016,3,7]],"date-time":"2016-03-07T00:00:00Z","timestamp":1457308800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,4,5]]},"abstract":"<jats:p>Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU\u2019s Single Instruction\/Multiple Threads architecture: concurrently running many threads and executing them as Single Instruction\/Multiple Data--style vectors. However, compute power is still lost due to cycles spent on data movement and control instructions instead of data computations. Even more cycles are lost on pipeline stalls resulting from long latency (memory) operations.<\/jats:p>\n          <jats:p>To improve not only performance but also energy efficiency, we introduce R-GPU: a reconfigurable GPU architecture with communicating cores. R-GPU is an addition to a GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In R-GPU data movement and control is implicit in the configuration of the network. Each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. On a number of benchmarks we show an average performance improvement of 2.1 \u00d7 over the same GPU without modifications. We further make a conservative power estimation of R-GPU which shows that power consumption can be reduced by 6%, leading to an energy consumption reduction of 55%, while area only increases by a mere 4%.<\/jats:p>","DOI":"10.1145\/2890506","type":"journal-article","created":{"date-parts":[[2016,3,8]],"date-time":"2016-03-08T13:33:07Z","timestamp":1457443987000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["R-GPU"],"prefix":"10.1145","volume":"13","author":[{"given":"Gert-Jan Van Den","family":"Braak","sequence":"first","affiliation":[{"name":"Eindhoven University of Technology, Eindhoven, The Netherlands"}]},{"given":"Henk","family":"Corporaal","sequence":"additional","affiliation":[{"name":"Eindhoven University of Technology, Eindhoven, The Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2016,3,7]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1109\/ISPASS.2009.4919648"},{"doi-asserted-by":"publisher","key":"e_1_2_1_2_1","DOI":"10.1145\/2063384.2063400"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/2555243.2555258"},{"doi-asserted-by":"publisher","key":"e_1_2_1_4_1","DOI":"10.1109\/IISWC.2009.5306797"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1109\/MC.2011.15"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1145\/2000064.2000093"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1007\/s00138-012-0443-3"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/1815961.1815998"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1145\/2485922.2485964"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.1109\/MM.2008.31"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1109\/ISPASS.2013.6557150"},{"volume-title":"ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix","year":"2003","author":"Mei Bingfeng","key":"e_1_2_1_12_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1109\/MICRO.2007.30"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_15_1","first-page":"680","article-title":"Single interconnect providing read and write access to a memory shared by concurrent threads. (16 2010)","volume":"7","author":"Nickolls J. R.","year":"2010","journal-title":"US Patent"},{"unstructured":"NVIDIA Corporation. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. (2009). http:\/\/www.nvidia.com\/content\/pdf\/fermi_white_papers\/nvidia_fermi_compute_architecture_whitepaper.pdf.  NVIDIA Corporation. 2009. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi. (2009). http:\/\/www.nvidia.com\/content\/pdf\/fermi_white_papers\/nvidia_fermi_compute_architecture_whitepaper.pdf.","key":"e_1_2_1_16_1"},{"volume-title":"CUDA C Programming Guide 5.0. (Oct","year":"2012","author":"NVIDIA Corporation","key":"e_1_2_1_17_1"},{"unstructured":"NVIDIA Corporation. 2012b. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. (2012). http:\/\/www.nvidia.com\/content\/pdf\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.  NVIDIA Corporation. 2012b. NVIDIA\u2019s Next Generation CUDA Compute Architecture: Kepler GK110. (2012). http:\/\/www.nvidia.com\/content\/pdf\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.","key":"e_1_2_1_18_1"},{"unstructured":"NVIDIA Corporation. 2013. NVIDIA Tegra K1: A New Era in Mobile Computing. http:\/\/www.nvidia.com\/content\/pdf\/tegra_white_papers\/tegra_k1_whitepaper_v1.0.pdf.  NVIDIA Corporation. 2013. NVIDIA Tegra K1: A New Era in Mobile Computing. http:\/\/www.nvidia.com\/content\/pdf\/tegra_white_papers\/tegra_k1_whitepaper_v1.0.pdf.","key":"e_1_2_1_19_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1109\/SASP.2009.5226333"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1109\/12.859540"},{"volume-title":"Parboil: A Revised Benchmark Suite for Scientific and Commercial throughput Computing. Technical Report IMPACT-12-01","year":"2012","author":"Stratton John A.","key":"e_1_2_1_22_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/2463596.2486153"},{"doi-asserted-by":"publisher","key":"e_1_2_1_24_1","DOI":"10.5555\/2665671.2665703"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1109\/TVLSI.2010.2047415"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2890506","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2890506","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:38:55Z","timestamp":1750221535000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2890506"}},"subtitle":["A Reconfigurable GPU Architecture"],"short-title":[],"issued":{"date-parts":[[2016,3,7]]},"references-count":25,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,4,5]]}},"alternative-id":["10.1145\/2890506"],"URL":"https:\/\/doi.org\/10.1145\/2890506","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2016,3,7]]},"assertion":[{"value":"2015-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-03-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}