{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:16:46Z","timestamp":1750220206598,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2022,9,16]],"date-time":"2022-09-16T00:00:00Z","timestamp":1663286400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Beijing Natural Science Foundation","award":["4214063"],"award-info":[{"award-number":["4214063"]}]},{"name":"Beijing Municipal Education Commission","award":["KM202110028011"],"award-info":[{"award-number":["KM202110028011"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,12,31]]},"abstract":"<jats:p>As more emerging applications are moving to GPUs, fine-grained synchronization has become imperative. However, their performance can be severely impaired in case of frequent synchronization failures caused by high data contention. Differently from CPUs, GPUs own thousands of hardware threads and adopt single instruction multiple threads paradigm, making it impractical to deploy the CPU contention management mechanisms directly on GPUs. In this article, we design a Software Warp Controlling Framework (SWCF), which employs producer-consumer execution model and leverages GPU hardware barriers to dynamically control the execution of warps at runtime. On the basis of SWCF, we propose a contention management strategy to decrease frequent synchronization failures while avoiding the over-reducing of parallelism. We evaluate SWCF and the proposed strategy on commodity GPUs using a set of applications with fine-grained synchronization. The results show that on V100 GPU our contention management achieves a 4.7X speedup and outperforms the conventional GPU software backoff solution by 42% on average.<\/jats:p>","DOI":"10.1145\/3547301","type":"journal-article","created":{"date-parts":[[2022,7,11]],"date-time":"2022-07-11T11:26:22Z","timestamp":1657538782000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUs"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5637-9417","authenticated-orcid":false,"given":"Lan","family":"Gao","sequence":"first","affiliation":[{"name":"Capital Normal University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3653-7013","authenticated-orcid":false,"given":"Jing","family":"Wang","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3969-5607","authenticated-orcid":false,"given":"Weigong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Capital Normal University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,9,16]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.87"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063400"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555258"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485966"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446071"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080204"},{"key":"e_1_3_1_8_2","volume-title":"Inside Volta: The Worlds Most Advanced Data Center GPU","author":"Durant Luke","year":"2017","unstructured":"Luke Durant, Olivier Giroux, Mark Harris, and Nick Stam. 2017. Inside Volta: The Worlds Most Advanced Data Center GPU. Retrieved March 7, 2022 from https:\/\/developer.nvidia.com\/blog\/inside-volta\/."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783714"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00040"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540743"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155655"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2955705"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000093"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835938"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451158"},{"key":"e_1_3_1_17_2","volume-title":"Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU","author":"Lakshminarayana Nagesh B.","year":"2010","unstructured":"Nagesh B. Lakshminarayana and Hyesoon Kim. 2010. Effect of instruction fetch and memory scheduling on GPU performance. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750418"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628107"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751232"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751232"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056024"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2006.78"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750396"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830822"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2016.0112"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926267"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/1693453.1693465"},{"key":"e_1_3_1_30_2","volume-title":"GPU Computing Gems Emerald Edition","author":"Martin Burtscher","year":"2011","unstructured":"Burtscher Martin and Keshav Pingali. 2011. An efficient CUDA implementation of the tree-based barnes hut n-body algorithm. In GPU Computing Gems Emerald Edition. Morgan Kaufmann."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815992"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2008.4636089"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2013.01.012"},{"volume-title":"CUDA Programming Guide","year":"2022","key":"e_1_3_1_35_2","unstructured":"NVIDIA. 2022. CUDA Programming Guide. Retrieved March 7, 2022 from http:\/\/docs.nvidia.com\/cuda\/pdf\/CUDA_C_Programming_Guide.pdf."},{"volume-title":"PTX ISA","year":"2022","key":"e_1_3_1_36_2","unstructured":"NVIDIA. 2022. PTX ISA. Retrieved March 7, 2022 from https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00029"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540718"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/1073814.1073861"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056031"},{"key":"e_1_3_1_42_2","volume-title":"SmallBank Benchmark","author":"Team The H-Store","year":"2013","unstructured":"The H-Store Team. 2013. SmallBank Benchmark. Retrieved March 7, 2022 from https:\/\/hstore.cs.brown.edu\/documentation\/deployment\/benchmarks\/smallbank\/."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.12"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2017.2776908"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304055"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2903150.2903155"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/2544137.2544139"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.82"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/1250662.1250668"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3547301","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3547301","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:56Z","timestamp":1750186976000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3547301"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,16]]},"references-count":48,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12,31]]}},"alternative-id":["10.1145\/3547301"],"URL":"https:\/\/doi.org\/10.1145\/3547301","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2022,9,16]]},"assertion":[{"value":"2022-03-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-06-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}