{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:35:12Z","timestamp":1750221312714,"version":"3.41.0"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2017,12,5]],"date-time":"2017-12-05T00:00:00Z","timestamp":1512432000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"YESS","award":["20150090"],"award-info":[{"award-number":["20150090"]}]},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["61433019, 61472435, 61202129, 61572058, 61472431, 61402501, and U14352217"],"award-info":[{"award-number":["61433019, 61472435, 61202129, 61572058, 61472431, 61402501, and U14352217"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2017,12,31]]},"abstract":"<jats:p>\n            The architecture and programming model of current GPGPUs are best suited for applications that are dominated by structured control and data flows across large regular datasets. Parallel workloads with irregular control and data structures cannot easily harness the processing power of the GPGPU. One approach for mapping these irregular-parallel workloads to GPGPUs is using work-queues. The work-queue approach improves the utilization of SIMD units by only processing useful works that are dynamically generated during execution. As current GPGPUs lack necessary supports for work-queues, a software-based work-queue implementation often suffers from memory contention and load balancing issues. In this article, we present a novel hardware work-queue design named\n            <jats:italic>DaQueue<\/jats:italic>\n            , which incorporates three data-aware features to improve the efficiency of work-queues on GPGPUs. We evaluate our proposal on the irregular-parallel workloads and carry out a case study on a path tracing pipeline with a cycle-level simulator. Experimental results show that for the tested workloads, DaQueue improves performance by 1.53\u00d7 on average and up to 1.91\u00d7. Compared to a hardware worklist approach that is the state-of-the-art prior work, DaQueue can achieve an average of 33.92% extra speedup with less hardware area cost.\n          <\/jats:p>","DOI":"10.1145\/3151035","type":"journal-article","created":{"date-parts":[[2017,12,6]],"date-time":"2017-12-06T21:23:15Z","timestamp":1512595395000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Improving the Efficiency of GPGPU Work-Queue Through Data Awareness"],"prefix":"10.1145","volume":"14","author":[{"given":"Libo","family":"Huang","sequence":"first","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yashuai","family":"L\u00fc","sequence":"additional","affiliation":[{"name":"Space Engineering University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Li","family":"Shen","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhiying","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2017,12,5]]},"reference":[{"volume-title":"Dynamic Parallelism in CUDA. White Paper","author":"NVIDIA.","key":"e_1_2_1_1_1","unstructured":"NVIDIA. 2012. Dynamic Parallelism in CUDA. White Paper . NVIDIA , Santa Clara, CA . NVIDIA. 2012. Dynamic Parallelism in CUDA. White Paper. NVIDIA, Santa Clara, CA."},{"volume-title":"The OpenCL C Specification Version: 2.0","author":"Khronos Group","key":"e_1_2_1_2_1","unstructured":"Khronos Group . 2015. The OpenCL C Specification Version: 2.0 . Khronos Group , Beaverton, OR . Khronos Group. 2015. The OpenCL C Specification Version: 2.0. Khronos Group, Beaverton, OR."},{"volume-title":"CUDA C Programming Guide v7.5","author":"NVIDIA.","key":"e_1_2_1_3_1","unstructured":"NVIDIA. 2016. CUDA C Programming Guide v7.5 . NVIDIA , Santa Clara, CA . NVIDIA. 2016. CUDA C Programming Guide v7.5. NVIDIA, Santa Clara, CA."},{"volume-title":"Parallel Thread Execution ISA v4.3","author":"NVIDIA.","key":"e_1_2_1_4_1","unstructured":"NVIDIA. 2016. Parallel Thread Execution ISA v4.3 . NVIDIA , Santa Clara, CA . NVIDIA. 2016. Parallel Thread Execution ISA v4.3. NVIDIA, Santa Clara, CA."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1921479.1921497"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda A.","key":"e_1_2_1_6_1","unstructured":"A. Bakhoda , G. L. Yuan , W. W. L. Fung , H. Wong , and T. M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . 163--174. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). 163--174."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063400"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402918"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018333"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485951"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/15922.15902"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.24"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2492045.2492060"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.44"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145832"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.28"},{"volume-title":"Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA\u201914)","author":"Orr Marc S.","key":"e_1_2_1_17_1","unstructured":"Marc S. Orr , Bradford M. Beckmann , Steven K. Reinhardt , and David A. Wood . 2014. Fine-grain task aggregation and coordination on GPUs . In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA\u201914) . IEEE, Los Alamitos, CA, 181--192. http:\/\/dl.acm.org\/citation.cfm?id&equals;2665671.2665701 Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA\u201914). IEEE, Los Alamitos, CA, 181--192. http:\/\/dl.acm.org\/citation.cfm?id&equals;2665671.2665701"},{"volume-title":"Physically Based Rendering","author":"Pharr Matt","key":"e_1_2_1_18_1","unstructured":"Matt Pharr and Greg Humphreys . 2010. Physically Based Rendering , Second Edition : From Theory to Implementation. Morgan Kaufmann . Matt Pharr and Greg Humphreys. 2010. Physically Based Rendering, Second Edition: From Theory to Implementation. Morgan Kaufmann."},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913)","author":"Sethia Ankit","year":"2013","unstructured":"Ankit Sethia , Ganesh Dasika , Mehrzad Samadi , and Scott Mahlke . 2013 . APOGEE: Adaptive prefetching on GPUs for energy efficiency . In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913) . IEEE, Los Alamitos, CA, 73--82. http:\/\/dl.acm.org\/citation.cfm?id&equals;2523721.2523735 Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913). IEEE, Los Alamitos, CA, 73--82. http:\/\/dl.acm.org\/citation.cfm?id&equals;2523721.2523735"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.45"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2366145.2366180"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661229.2661250"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1477926.1477930"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the Conference on High Performance Graphics (HPG\u201910)","author":"Tzeng Stanley","year":"1921","unstructured":"Stanley Tzeng , Anjul Patney , and John D. Owens . 2010. Task management for irregular-parallel workloads on the GPU . In Proceedings of the Conference on High Performance Graphics (HPG\u201910) . 29--37. http:\/\/dl.acm.org\/citation.cfm?id&equals; 1921 479.1921485 Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task management for irregular-parallel workloads on the GPU. In Proceedings of the Conference on High Performance Graphics (HPG\u201910). 29--37. http:\/\/dl.acm.org\/citation.cfm?id&equals;1921479.1921485"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018330"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018331"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750393"},{"volume-title":"Proceedings of the 2014 IEEE International Symposium on Workload Characterization (HSWC\u201914)","author":"Wang J.","key":"e_1_2_1_28_1","unstructured":"J. Wang and S. Yalamanchili . 2014. Characterization and analysis of dynamic parallelism in unstructured GPU applications . In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (HSWC\u201914) . 51--60. J. Wang and S. Yalamanchili. 2014. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (HSWC\u201914). 51--60."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2018323.2018335"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3151035","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3151035","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3151035","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T02:13:39Z","timestamp":1750212819000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3151035"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,12,5]]},"references-count":29,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,12,31]]}},"alternative-id":["10.1145\/3151035"],"URL":"https:\/\/doi.org\/10.1145\/3151035","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2017,12,5]]},"assertion":[{"value":"2017-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-12-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}