{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,21]],"date-time":"2025-06-21T11:27:08Z","timestamp":1750505228355,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":32,"publisher":"ACM","license":[{"start":{"date-parts":[[2016,6,1]],"date-time":"2016-06-01T00:00:00Z","timestamp":1464739200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2016,6]]},"DOI":"10.1145\/2925426.2926267","type":"proceedings-article","created":{"date-parts":[[2016,6,10]],"date-time":"2016-06-10T13:04:07Z","timestamp":1465563847000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Barrier-Aware Warp Scheduling for Throughput Processors"],"prefix":"10.1145","author":[{"given":"Yuxi","family":"Liu","sequence":"first","affiliation":[{"name":"Peking University and Shenzhen Institute of Advanced Technology, CAS and Ghent University"}]},{"given":"Zhibin","family":"Yu","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Advanced Technology, CAS"}]},{"given":"Lieven","family":"Eeckhout","sequence":"additional","affiliation":[{"name":"Ghent University"}]},{"given":"Vijay Janapa","family":"Reddi","sequence":"additional","affiliation":[{"name":"University of Texas at Austin"}]},{"given":"Yingwei","family":"Luo","sequence":"additional","affiliation":[{"name":"Peking University"}]},{"given":"Xiaolin","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University"}]},{"given":"Zhenlin","family":"Wang","sequence":"additional","affiliation":[{"name":"Michigan Tech University"}]},{"given":"Chengzhong","family":"Xu","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Advanced Technology, CAS and Wayne State University"}]}],"member":"320","published-online":{"date-parts":[[2016,6]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"version 3.0. NVIDIA CORPORATION","author":"CUDA","year":"2010","unstructured":"CUDA programming guide , version 3.0. NVIDIA CORPORATION , 2010 . CUDA programming guide, version 3.0. NVIDIA CORPORATION, 2010."},{"key":"e_1_3_2_1_2_1","volume-title":"Advanced Micro Devices","author":"ATI","year":"2011","unstructured":"ATI stream technology. Advanced Micro Devices , Inc . http:\/\/www.amd.com\/stream., 2011 . ATI stream technology. Advanced Micro Devices,Inc. http:\/\/www.amd.com\/stream., 2011."},{"key":"e_1_3_2_1_3_1","volume-title":"http:\/\/www.khronos.org\/opencl","author":"OpenCL. Khronos Group","year":"2012","unstructured":"OpenCL. Khronos Group . http:\/\/www.khronos.org\/opencl ., 2012 . OpenCL. Khronos Group. http:\/\/www.khronos.org\/opencl., 2012."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1413957.1413967"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_2_1_7_1","first-page":"3801","volume-title":"Proceedings of the International Symposium on Circuits and Systems (ISCAS)","author":"Feng Wu-Chun","year":"2010","unstructured":"Wu-Chun Feng and Shucai Xiao . To GPU synchronize or not GPU synchronize ? In Proceedings of the International Symposium on Circuits and Systems (ISCAS) , pages 3801 -- 3804 , 2010 . Wu-Chun Feng and Shucai Xiao. To GPU synchronize or not GPU synchronize? In Proceedings of the International Symposium on Circuits and Systems (ISCAS), pages 3801--3804, 2010."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000093"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304583"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.62"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454152"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"key":"e_1_3_2_1_13_1","first-page":"395","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog , Onur Kayiran , Nachiappan Chidambaram Nachiappan , Asit K Mishra , Mahmut T Kandemir , Onur Mutlu , Ravishankar Iyer , and Chita R Das . OWL : Cooperative thread array aware scheduling techniques for improving GPGPU performance . In Proceedings of the International Symposium on Computer Architecture (ISCA) , pages 395 -- 406 , 2013 . Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 395--406, 2013."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485951"},{"key":"e_1_3_2_1_15_1","first-page":"157","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT)","author":"Kay\u0131ran Onur","year":"2013","unstructured":"Onur Kay\u0131ran , Adwait Jog , Mahmut Taylan Kandemir , and Chita Ranjan Das . Neither more nor less: Optimizing thread-level parallelism for GPGPUs . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT) , pages 157 -- 166 , 2013 . Onur Kay\u0131ran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 157--166, 2013."},{"key":"e_1_3_2_1_16_1","volume-title":"Workshop on Language, Compiler, and Architecture Support for GPGPU","author":"Lakshminarayana Nagesh B","year":"2010","unstructured":"Nagesh B Lakshminarayana and Hyesoon Kim . Effect of instruction fetch and memory scheduling on GPU performance . In Workshop on Language, Compiler, and Architecture Support for GPGPU , 2010 . Nagesh B Lakshminarayana and Hyesoon Kim. Effect of instruction fetch and memory scheduling on GPU performance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750418"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628107"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056024"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830822"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_3_2_1_24_1","unstructured":"CUDA Nvidia. CUDA SDK code samples.  CUDA Nvidia. CUDA SDK code samples."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807598"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540718"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056031"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522351"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835939"},{"key":"e_1_3_2_1_32_1","first-page":"1","volume-title":"Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS)","author":"Xiao Shucai","year":"2010","unstructured":"Shucai Xiao and Wu-Chun Feng . Inter-block GPU communication via fast barrier synchronization . In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS) , pages 1 -- 12 , 2010 . Shucai Xiao and Wu-Chun Feng. Inter-block GPU communication via fast barrier synchronization. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.82"}],"event":{"name":"ICS '16: 2016 International Conference on Supercomputing","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"],"location":"Istanbul Turkey","acronym":"ICS '16"},"container-title":["Proceedings of the 2016 International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2925426.2926267","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2925426.2926267","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:04:25Z","timestamp":1750273465000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2925426.2926267"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,6]]},"references-count":32,"alternative-id":["10.1145\/2925426.2926267","10.1145\/2925426"],"URL":"https:\/\/doi.org\/10.1145\/2925426.2926267","relation":{},"subject":[],"published":{"date-parts":[[2016,6]]},"assertion":[{"value":"2016-06-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}