{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:35:11Z","timestamp":1767854111860,"version":"3.49.0"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,4,18]],"date-time":"2019-04-18T00:00:00Z","timestamp":1555545600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,6,30]]},"abstract":"<jats:p>Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.<\/jats:p>\n          <jats:p>In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching\/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.<\/jats:p>","DOI":"10.1145\/3291050","type":"journal-article","created":{"date-parts":[[2019,4,19]],"date-time":"2019-04-19T16:56:23Z","timestamp":1555692983000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["HAWS"],"prefix":"10.1145","volume":"16","author":[{"given":"Xun","family":"Gong","sequence":"first","affiliation":[{"name":"Northeastern University, Boston, MA, USA"}]},{"given":"Xiang","family":"Gong","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA, USA"}]},{"given":"Leiming","family":"Yu","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA, USA"}]},{"given":"David","family":"Kaeli","sequence":"additional","affiliation":[{"name":"Northeastern University, Boston, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,4,18]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 2nd Workshop on Explicitly Parallel Instruction Computing Architecture and Compilers (EPIC\u201902)","author":"Beyls Kristof","year":"2002","unstructured":"Kristof Beyls and Erik D\u2019Hollander . 2002 . Compile-time cache hint generation for EPIC architectures . In Proceedings of the 2nd Workshop on Explicitly Parallel Instruction Computing Architecture and Compilers (EPIC\u201902) . Kristof Beyls and Erik D\u2019Hollander. 2002. Compile-time cache hint generation for EPIC architectures. In Proceedings of the 2nd Workshop on Explicitly Parallel Instruction Computing Architecture and Compilers (EPIC\u201902)."},{"key":"e_1_2_1_2_1","volume-title":"cuDNN: Efficient primitives for deep learning. Arxiv Preprint Arxiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014. cuDNN: Efficient primitives for deep learning. Arxiv Preprint Arxiv:1410.0759 ( 2014 ). Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. Arxiv Preprint Arxiv:1410.0759 (2014)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024723.2000093"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/3049832.3049838"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2017.7975298"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/DICTA.2008.82"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830784"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2508148.2485952"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499368.2451158"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/1148882.1148891"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 157--166","author":"Kay\u0131ran Onur","year":"2013","unstructured":"Onur Kay\u0131ran , Adwait Jog , Mahmut Taylan Kandemir , and Chita Ranjan Das . 2013 . Neither more nor less: Optimizing thread-level parallelism for GPGPUs . In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 157--166 . Onur Kay\u0131ran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 157--166."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446062"},{"key":"e_1_2_1_13_1","volume-title":"Heterogeneous system architecture: A technical review. AMD Fusion Dev. Summit","author":"Kyriazis George","year":"2012","unstructured":"George Kyriazis . 2012. Heterogeneous system architecture: A technical review. AMD Fusion Dev. Summit ( 2012 ), 21. George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Dev. Summit (2012), 21."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2012.7476485"},{"key":"e_1_2_1_16_1","volume-title":"An energy-efficient GPGPU register file architecture using racetrack memory","author":"Mao Mengjie","year":"2017","unstructured":"Mengjie Mao , Wujie Wen , Yaojun Zhang , Yiran Chen , and Hai Li. 2017. An energy-efficient GPGPU register file architecture using racetrack memory . IEEE Trans. Comput . ( 2017 ). Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An energy-efficient GPGPU register file architecture using racetrack memory. IEEE Trans. Comput. (2017)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/977091.977115"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2003.1196114"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1816038.1815992"},{"key":"e_1_2_1_20_1","volume-title":"Jouppi","author":"Muralimanohar Naveen","year":"2009","unstructured":"Naveen Muralimanohar , Rajeev Balasubramonian , and Norman P . Jouppi . 2009 . CACTI 6.0: A tool to model large caches. HP Lab . (2009), 22--31. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Lab. (2009), 22--31."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_22_1","unstructured":"CUDA Nvidia. 2007. Compute unified device architecture programming guide.  CUDA Nvidia. 2007. Compute unified device architecture programming guide."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2010.115"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522352"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750410"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.820037"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056031"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370865"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967947"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 199--208","author":"Wang Zhenlin","unstructured":"Zhenlin Wang , Kathryn S. McKinley , Arnold L. Rosenberg , and Charles C. Weems . 2002. Using the compiler to improve cache replacement decisions . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 199--208 . Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 199--208."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751234"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3291050","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3291050","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:01:52Z","timestamp":1750208512000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3291050"}},"subtitle":["Accelerating GPU Wavefront Execution through Selective Out-of-order Execution"],"short-title":[],"issued":{"date-parts":[[2019,4,18]]},"references-count":33,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,6,30]]}},"alternative-id":["10.1145\/3291050"],"URL":"https:\/\/doi.org\/10.1145\/3291050","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,4,18]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-04-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}