{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T16:41:27Z","timestamp":1773247287973,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T00:00:00Z","timestamp":1770681600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["2235276, 2349144, 2349143, 2349582, and 2349141"],"award-info":[{"award-number":["2235276, 2349144, 2349143, 2349582, and 2349141"]}]},{"name":"InnoHK initiative of the Innovation and Technology Commission of the Hong Kong Special Administrative Region Government"},{"name":"Hong Kong Jockey Club Charities Trust"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2026,5,31]]},"abstract":"<jats:p>\n                    Graph partitioning is important for the design of many CAD algorithms. However, as the graph size continues to grow, graph partitioning becomes increasingly time-consuming. Recent research has introduced parallel graph partitioners using either multi-core CPUs or GPUs. However, the speedup of existing CPU graph partitioners is typically limited to a few cores, while the performance of GPU-based solutions is algorithmically limited by available GPU memory. To overcome these challenges, we propose G-kway, an efficient multilevel GPU-accelerated\n                    <jats:italic toggle=\"yes\">k<\/jats:italic>\n                    -way graph partitioner. G-kway introduces an effective union find-based coarsening and a novel independent set-based refinement algorithm to significantly accelerate both the coarsening and uncoarsening stages. Furthermore, when kernel launch overhead becomes substantial in the refinement algorithm, G-kway employs CUDA Graph-based uncoarsening to reduce the overhead and improve performance. Experimental results have shown that G-kway outperforms both the state-of-the-art CPU-based and GPU-based parallel partitioners with an average speedup of 8.6\u00d7 and 3.8\u00d7, respectively, while achieving comparable partitioning quality. Additionally, G-kway with CUDA Graph-based uncoarsening can further accelerate graph partitioning, achieving up to 1.93\u00d7 speedup over the default G-kway.\n                  <\/jats:p>","DOI":"10.1145\/3734522","type":"journal-article","created":{"date-parts":[[2025,5,3]],"date-time":"2025-05-03T07:42:28Z","timestamp":1746258148000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["G-kway: Multilevel GPU-Accelerated k-way Graph Partitioner using Task Graph Parallelism"],"prefix":"10.1145","volume":"31","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-2156-2765","authenticated-orcid":false,"given":"Wan Luan","family":"Lee","sequence":"first","affiliation":[{"name":"University of Wisconsin-Madison","place":["Madison, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3075-7437","authenticated-orcid":false,"given":"Dian-Lun","family":"Lin","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison","place":["Madison, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0718-1003","authenticated-orcid":false,"given":"Shui","family":"Jiang","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Ma Liu Shui, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0406-885X","authenticated-orcid":false,"given":"Cheng-Hsiang","family":"Chiu","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison","place":["Madison, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0977-2774","authenticated-orcid":false,"given":"Yibo","family":"Lin","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6406-4810","authenticated-orcid":false,"given":"Bei","family":"Yu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Ma Liu Shui, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7348-5625","authenticated-orcid":false,"given":"Tsung-Yi","family":"Ho","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Ma Liu Shui, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9768-3378","authenticated-orcid":false,"given":"Tsung-Wei","family":"Huang","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison","place":["Madison, United States"]}]}],"member":"320","published-online":{"date-parts":[[2026,2,10]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3001645"},{"key":"e_1_3_1_3_2","volume-title":"Recent Advances in Graph Partitioning","author":"Bulu\u00e7 Ayd\u0131n","year":"2016","unstructured":"Ayd\u0131n Bulu\u00e7, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. 2016. Recent Advances in Graph Partitioning. Springer."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC62836.2024.10938506"},{"key":"e_1_3_1_5_2","volume-title":"Multilevel Optimization in VLSICAD","author":"Cong Jingsheng Jason","year":"2013","unstructured":"Jingsheng Jason Cong and Joseph R. Shinnerl. 2013. Multilevel Optimization in VLSICAD. Vol. 14. Springer Science & Business Media."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.5555\/800263.809204"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.21136\/CMJ.1975.101357"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/800119.803884"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2016.16"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCS48598.2019.9188120"},{"key":"e_1_3_1_11_2","series-title":"LIPIcs","first-page":"48:1\u201348:17","volume-title":"29th Annual European Symposium on Algorithms, ESA 2021","author":"Gottesb\u00fcren Lars","year":"2021","unstructured":"Lars Gottesb\u00fcren, Tobias Heuer, Peter Sanders, Christian Schulz, and Daniel Seemaier. 2021. Deep multilevel graph partitioning. In 29th Annual European Symposium on Algorithms, ESA 2021(LIPIcs, Vol. 204). Schloss Dagstuhl-Leibniz-Zentrum f\u00fcr Informatik, 48:1\u201348:17."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415631"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.3007319"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00105"},{"issue":"6","key":"e_1_3_1_15_2","first-page":"1303","article-title":"Taskflow: A lightweight parallel and heterogeneous task graph computing system","volume":"33","author":"Huang Tsung-Wei","year":"2022","unstructured":"Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin. 2022. Taskflow: A lightweight parallel and heterogeneous task graph computing system. IEEE TPDS 33, 6 (2022), 1303\u20131320.","journal-title":"IEEE TPDS"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2015.7372666"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-47789-6_4"},{"key":"e_1_3_1_18_2","unstructured":"Laurent Hyafil and Ronald L. Rivest. 1973. Graph partitioning and constructing optimal decision trees are polynomial complete problems. Vol. 1. IRIA Research Report."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3715984"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-90-481-9591-6"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1137\/S1064827595287997"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2015.15"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.50"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC63849.2025.11132904"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3658617.3697551"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3656238"},{"key":"e_1_3_1_27_2","volume-title":"Task-Parallel Heterogeneous Programming System for Logic Simulation","author":"Lin Dian-Lun","year":"2024","unstructured":"Dian-Lun Lin. 2024. Task-Parallel Heterogeneous Programming System for Logic Simulation. Ph. D. Dissertation. The University of Wisconsin-Madison."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC43674.2020.9286218"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-85665-6_27"},{"issue":"11","key":"e_1_3_1_30_2","first-page":"3041","article-title":"Accelerating large sparse neural network inference using GPU task graph parallelism","volume":"33","author":"Lin Dian-Lun","year":"2022","unstructured":"Dian-Lun Lin and Tsung-Wei Huang. 2022. Accelerating large sparse neural network inference using GPU task graph parallelism. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2022), 3041\u20133052.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-69583-4_11"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3545008.3545091"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10247942"},{"key":"e_1_3_1_34_2","volume-title":"Graph Partitioning and Graph Clustering: 10th DIMACS Implementation Challenge Workshop","author":"Meyerhenke Henning","year":"2013","unstructured":"Henning Meyerhenke and David A. Bader. 2013. Graph Partitioning and Graph Clustering: 10th DIMACS Implementation Challenge Workshop. AMS."},{"key":"e_1_3_1_35_2","first-page":"1","article-title":"Approximation algorithms for multilevel graph partitioning","volume":"10","author":"Monien Burkhard","year":"2007","unstructured":"Burkhard Monien, Robert Preis, Stefan Schamberger, and T. Gonzalez. 2007. Approximation algorithms for multilevel graph partitioning. Handbook of Approximation Algorithms and Metaheuristics 10 (2007), 1\u201360.","journal-title":"Handbook of Approximation Algorithms and Metaheuristics"},{"key":"e_1_3_1_36_2","volume-title":"IEEE\/ACM Asia and South Pacific Design Automation Conference (ASP-DAC)","author":"Morchdi Chedi","year":"2024","unstructured":"Chedi Morchdi, Cheng-Hsiang Chiu, Yi Zhou, and Tsung-Wei Huang. 2024. A resource-efficient task scheduling system using reinforcement learning. In IEEE\/ACM Asia and South Pacific Design Automation Conference (ASP-DAC)."},{"key":"e_1_3_1_37_2","unstructured":"NVIDIA. 2019. CUDA Graphs: Accelerating Your GPU Workloads. Retrieved from https:\/\/developer.nvidia.com\/blog\/cuda-graphs\/. Accessed: 2024-08-23."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/SUPERC.1992.236711"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/0956-0521(91)90014-V"},{"key":"e_1_3_1_40_2","unstructured":"Yuan Lin and Vinod Grover. 2018. Using CUDA Warp-Level Primitives. Nvidia Technical Blog: https:\/\/developer.nvidia.com\/blog\/using-cuda-warp-level-primitives\/"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3658617.3697738"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3656230"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3734522","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3734522","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T13:55:07Z","timestamp":1770731707000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3734522"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,10]]},"references-count":41,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,5,31]]}},"alternative-id":["10.1145\/3734522"],"URL":"https:\/\/doi.org\/10.1145\/3734522","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,10]]},"assertion":[{"value":"2024-08-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}