{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:19:57Z","timestamp":1750306797954,"version":"3.41.0"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"3s","license":[{"start":{"date-parts":[[2014,3,1]],"date-time":"2014-03-01T00:00:00Z","timestamp":1393632000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"School of Computer Science at the Institute of Research in Fundamental Science"},{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2014,3]]},"abstract":"<jats:p>SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilization occurs when SIMD execution units are not entirely utilized due to branch divergences. This affects lane activity and results in SIMD inefficiency. Depth underutilization takes place when the pipeline runs out of active threads and is forced to leave pipeline stages idle. This work addresses both inefficiencies by harnessing inactive threads available to the pipeline. We introduce Harnessing inActive thReads in many-core Processors (or simply HARP) to improve width and depth utilization in accelerators. We show how using inactive yet ready threads can enhance performance. Moreover, we investigate implementation details and study microarchitectural changes needed to build a HARP-enhanced accelerator. Furthermore, we evaluate HARP under a variety of microarchitectural design points. We measure the area overhead associated with HARP and compare to conventional alternatives. Under Fermi-like GPUs, we show that HARP provides 10% speedup on average (maximum of 1.6X) at the cost of 3.5% area overhead. Our analysis shows that HARP performs better under narrower SIMD and shorter pipelines.<\/jats:p>","DOI":"10.1145\/2567938","type":"journal-article","created":{"date-parts":[[2014,3,25]],"date-time":"2014-03-25T13:34:12Z","timestamp":1395754452000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["HARP"],"prefix":"10.1145","volume":"13","author":[{"given":"Ahmad","family":"Lashgar","sequence":"first","affiliation":[{"name":"University of Tehran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ahmad","family":"Khonsari","sequence":"additional","affiliation":[{"name":"University of Tehran"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amirali","family":"Baniasadi","sequence":"additional","affiliation":[{"name":"University of Victoria"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2014,3,28]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"AMD Inc. 2013. AMD accelerated parallel processing opencl programming guide. http:\/\/developer.amd.com\/wordpress\/media\/2013\/08\/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf.  AMD Inc. 2013. AMD accelerated parallel processing opencl programming guide. http:\/\/developer.amd.com\/wordpress\/media\/2013\/08\/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf."},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09)","author":"Bakhoda Ali","key":"e_1_2_1_2_1","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing cuda workloads using a detailed gpu simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09) . 163--174. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337166"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_5_1","unstructured":"Sylvain Collange. 2011. Stack-less simt reconvergence at low cost. Tech. rep. http:\/\/hal.archives-ouvertes.fr\/docs\/00\/62\/26\/54\/PDF\/collange_sympa2011_en.pdf.  Sylvain Collange. 2011. Stack-less simt reconvergence at low cost. Tech. rep. http:\/\/hal.archives-ouvertes.fr\/docs\/00\/62\/26\/54\/PDF\/collange_sympa2011_en.pdf."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155676"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.12"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1543753.1543756"},{"volume-title":"Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11)","author":"Wilson W.","key":"e_1_2_1_9_1","unstructured":"Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient simt control flow . In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11) . IEEE Computer Society, 25--36. Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient simt control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11). IEEE Computer Society, 25--36."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.51"},{"key":"e_1_2_1_11_1","unstructured":"Hynix Semiconductor. 2009. 1Gb (32mx32) gddr5 sgram h5gq1h24afr. http:\/\/www.hynix.com\/datasheet\/pdf\/graphics\/H5GQ1H24AFR(Rev1.0).pdf.  Hynix Semiconductor. 2009. 1Gb (32mx32) gddr5 sgram h5gq1h24afr. http:\/\/www.hynix.com\/datasheet\/pdf\/graphics\/H5GQ1H24AFR(Rev1.0).pdf."},{"key":"e_1_2_1_12_1","unstructured":"Imagination Technologies. 2012. PowerVR series 5 architecture guide for developers. http:\/\/www.imgtec.com\/powervr\/insider\/docs\/PowerVR&percnt;20Series&percnt;205.Architecture&percnt;20Guide&percnt;20for&percnt;20Developers.pdf.  Imagination Technologies. 2012. PowerVR series 5 architecture guide for developers. http:\/\/www.imgtec.com\/powervr\/insider\/docs\/PowerVR&percnt;20Series&percnt;205.Architecture&percnt;20Guide&percnt;20for&percnt;20Developers.pdf."},{"volume-title":"The Art of Computer Systems Performance Analysis","author":"Jain Raj","key":"e_1_2_1_13_1","unstructured":"Raj Jain . 1991. The Art of Computer Systems Performance Analysis . Vol. 182 , John Wiley and Sons . Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Vol. 182, John Wiley and Sons."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451158"},{"key":"e_1_2_1_15_1","unstructured":"Khronos Group. 2013. OpenCL - The open standard for parallel programming of heterogeneous systems. http:\/\/www.khronos.org\/opencl\/.  Khronos Group. 2013. OpenCL - The open standard for parallel programming of heterogeneous systems. http:\/\/www.khronos.org\/opencl\/."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_17_1","unstructured":"Roberto Mijat. 2012. Take gpu processing power beyond graphics with mali gpu computing. http:\/\/malideveloper.arm.com\/downloads\/WhitePaper_GPU_Computing_on_Mali.pdf.  Roberto Mijat. 2012. Take gpu processing power beyond graphics with mali gpu computing. http:\/\/malideveloper.arm.com\/downloads\/WhitePaper_GPU_Computing_on_Mali.pdf."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1816038.1815992"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.30"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_21_1","unstructured":"NVIDIA Corp. 2008. NVIDIA cuda sdk 2.3. https:\/\/developer.nvidia.com\/cuda-toolkit-23-downloads.  NVIDIA Corp. 2008. NVIDIA cuda sdk 2.3. https:\/\/developer.nvidia.com\/cuda-toolkit-23-downloads."},{"key":"e_1_2_1_22_1","unstructured":"NVIDIA Corp. 2012a. CUDA c programming guide. http:\/\/docs.nvidia.com\/cuda\/cuda-cprogramming-guide\/index.html.  NVIDIA Corp. 2012a. CUDA c programming guide. http:\/\/docs.nvidia.com\/cuda\/cuda-cprogramming-guide\/index.html."},{"key":"e_1_2_1_23_1","unstructured":"NVIDIA Corp. 2012b. Kepler gk110 architecture. http:\/\/www.nvidia.com\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.  NVIDIA Corp. 2012b. Kepler gk110 architecture. http:\/\/www.nvidia.com\/content\/PDF\/kepler\/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf."},{"key":"e_1_2_1_24_1","unstructured":"NVIDIA Corp. 2012c. CUDA c programming guide compute capability section. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/&num;compute-capabilities  NVIDIA Corp. 2012c. CUDA c programming guide compute capability section. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/&num;compute-capabilities"},{"key":"e_1_2_1_25_1","unstructured":"NVIDIA Corp. 2012d. CUDA gpus. https:\/\/developer.nvidia.com\/cuda-gpus.  NVIDIA Corp. 2012d. CUDA gpus. https:\/\/developer.nvidia.com\/cuda-gpus."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337167"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522352"},{"key":"e_1_2_1_28_1","volume-title":"Geng Daniel Liu, and Wen-Mei W. Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen-Mei W. Hwu . 2012 . Parboil : A revised benchmark suite for scientific and commercial throughput computing. Tech. rep., IMPACT. http:\/\/impact.crhc.illinois.edu\/shared\/report\/impact-12-01.parboil.pdf. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Tech. rep., IMPACT. http:\/\/impact.crhc.illinois.edu\/shared\/report\/impact-12-01.parboil.pdf."},{"volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing (SC'08)","author":"Volkov Vasily","key":"e_1_2_1_29_1","unstructured":"Vasily Volkov and James W. Demmel . 2008. Benchmarking gpus to tune dense linear algebra . In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC'08) . Vasily Volkov and James W. Demmel. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC'08)."},{"key":"e_1_2_1_30_1","unstructured":"Wikipedia. 2013. GeForce 400 series. http:\/\/en.wikipedia.org\/wiki\/GeForce_400_Series.  Wikipedia. 2013. GeForce 400 series. http:\/\/en.wikipedia.org\/wiki\/GeForce_400_Series."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.24"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2010.5452013"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011434814"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s02011-011-1137-8"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669119"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2567938","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2567938","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:34:40Z","timestamp":1750232080000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2567938"}},"subtitle":["Harnessing inactive threads in many-core processors"],"short-title":[],"issued":{"date-parts":[[2014,3]]},"references-count":35,"journal-issue":{"issue":"3s","published-print":{"date-parts":[[2014,3]]}},"alternative-id":["10.1145\/2567938"],"URL":"https:\/\/doi.org\/10.1145\/2567938","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2014,3]]},"assertion":[{"value":"2012-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-03-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}