{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T16:43:41Z","timestamp":1777999421904,"version":"3.51.4"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"Strategic Priority Research Program of the Chinese Academy of Sciences","award":["XDB0500102"],"award-info":[{"award-number":["XDB0500102"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>Graph Neural Networks (GNNs) have emerged as powerful tools for graph-based machine learning tasks, but their performance is often constrained by inefficient sparse operators and limited hardware utilization during multi-operator workflows. This article presents GNNPilot, a holistic optimization framework that addresses these challenges through three key innovations. First, we introduce two packing strategies for gather operators, including neighbor packing for load balancing in sparser graphs, and bin packing with a new sparse format for enhanced data locality in denser graphs. Second, we propose dynamic parallelization methods and a novel row panel-based kernel fusion technique to optimize complex multi-operator GNN models. Third, we develop a lightweight sampling-based auto-tuning mechanism that adapts the framework\u2019s optimization strategies to varying input characteristics. Built upon tensor expression-based intermediate representations, GNNPilot maintains the flexibility to optimize both popular and customized GNN models. Extensive experiments across diverse GNN models and graph datasets demonstrate that GNNPilot achieves substantial speedups over state-of-the-art implementations in both the performance of single operators and the efficiency of end-to-end inference. These results establish GNNPilot as an efficient and adaptive solution for accelerating GNN computations on modern GPU architectures.<\/jats:p>\n          <jats:p\/>","DOI":"10.1145\/3730586","type":"journal-article","created":{"date-parts":[[2025,4,22]],"date-time":"2025-04-22T11:09:15Z","timestamp":1745320155000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["GNNPilot: A Holistic Framework for High-Performance Graph Neural Network Computations on GPUs"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8500-6173","authenticated-orcid":false,"given":"Zhengding","family":"Hu","sequence":"first","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5098-1503","authenticated-orcid":false,"given":"Jingwei","family":"Sun","sequence":"additional","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China","place":["Hefei, China"]},{"name":"China and Laoshan Laboratory","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0794-7681","authenticated-orcid":false,"given":"Guangzhong","family":"Sun","sequence":"additional","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China","place":["Hefei, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,7,2]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2021. dgSPARSE. Retrieved April 29 2025 from https:\/\/dgsparse.github.io\/"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.125"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.110"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1609\/icwsm.v14i1.7347"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342007083801"},{"key":"e_1_3_2_7_2","first-page":"578","volume-title":"Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et\u00a0al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578\u2013594. Retrieved from https:\/\/www.usenix.org\/conference\/osdi18\/presentation\/chen"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476182"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415610"},{"key":"e_1_3_2_10_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. Cudnn: Efficient primitives for deep learning. arXiv:1410.0759. Retrieved April 29 2025 from https:\/\/arxiv.org\/abs\/1410.0759"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530508"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00071"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651378"},{"key":"e_1_3_2_15_2","unstructured":"Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch geometric. arXiv:1903.02428. Retrieved April 29 2025 from https:\/\/arxiv.org\/abs\/1903.02428"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.physrep.2009.11.002"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607040"},{"key":"e_1_3_2_18_2","first-page":"5119","article-title":"Learning graph representations with embedding propagation","volume":"30","author":"Duran Alberto Garcia","year":"2017","unstructured":"Alberto Garcia Duran and Mathias Niepert. 2017. Learning graph representations with embedding propagation. Advances in Neural Information Processing Systems 30, 1 (2017), 5119\u20135130. Retrieved April 29, 2025 from https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/e0688d13958a19e087e123148555e4b4-Abstract.html","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651351"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00077"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00020"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295712"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295712"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/77726.255144"},{"key":"e_1_3_2_25_2","first-page":"22118","article-title":"Open graph benchmark: Datasets for machine learning on graphs","volume":"33","author":"Hu Weihua","year":"2020","unstructured":"Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems 33, 1 (2020), 22118\u201322133. Retrieved April 29, 2025 from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00075"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS57955.2024.00010"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00076"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441585"},{"key":"e_1_3_2_30_2","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved April 29 2025 from https:\/\/arxiv.org\/abs\/1609.02907"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553447"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751209"},{"key":"e_1_3_2_33_2","first-page":"443","volume-title":"Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Ma Lingxiao","year":"2019","unstructured":"Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. NeuGraph: Parallel deep neural network computation on large graphs. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC 19). 443\u2013458. Retrieved from https:\/\/www.usenix.org\/conference\/atc19\/presentation\/ma"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.576"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00016"},{"key":"e_1_3_2_36_2","unstructured":"NVIDIA. [n.d.]. cuSPARSE API Reference. Retrieved April 29 2025 from https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"key":"e_1_3_2_37_2","unstructured":"NVIDIA. 2018. Nsight Compute. Retrieved April 29 2025 from https:\/\/docs.nvidia.com\/nsight-compute\/NsightCompute\/index.html"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00034"},{"key":"e_1_3_2_39_2","volume-title":"SPARSKIT: A Basic Tool Kit for Sparse Matrix Computations","author":"Saad Youcef","year":"1990","unstructured":"Youcef Saad. 1990. SPARSKIT: A Basic Tool Kit for Sparse Matrix Computations. Technical Report. Retrieved from https:\/\/ntrs.nasa.gov\/citations\/19910023551"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1093\/NAR\/GKY1131"},{"key":"e_1_3_2_41_2","unstructured":"Petar Veli\u010dkovi\u0107 Guillem Cucurull Arantxa Casanova Adriana Romero Pietro Lio and Yoshua Bengio. 2017. Graph attention networks. arXiv:1710.10903. Retrieved April 29 2025 from https:\/\/arxiv.org\/abs\/1710.10903"},{"key":"e_1_3_2_42_2","volume-title":"Proceedings of the ICLR Workshop on Representation Learning on Graphs and Manifolds","author":"Wang Minjie Yu","year":"2019","unstructured":"Minjie Yu Wang. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. In Proceedings of the ICLR Workshop on Representation Learning on Graphs and Manifolds. Retrieved from https:\/\/par.nsf.gov\/biblio\/10311680"},{"key":"e_1_3_2_43_2","first-page":"515","volume-title":"Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21)","author":"Wang Yuke","year":"2021","unstructured":"Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. 2021. GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 515\u2013531. Retrieved from https:\/\/www.usenix.org\/conference\/osdi21\/presentation\/wang-yuke"},{"key":"e_1_3_2_44_2","first-page":"149","volume-title":"Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23)","author":"Wang Yuke","year":"2023","unstructured":"Yuke Wang, Boyuan Feng, Zheng Wang, Guyue Huang, and Yufei Ding. 2023. TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23). 149\u2013164. Retrieved from https:\/\/www.usenix.org\/conference\/atc23\/presentation\/wang-yuke"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651322"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447786.3456247"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1039\/C7SC02664A"},{"key":"e_1_3_2_48_2","unstructured":"Keyulu Xu Weihua Hu Jure Leskovec and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv:1810.00826. Retrieved April 29 2025 from https:\/\/arxiv.org\/abs\/1810.00826"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555255"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-96983-1_48"},{"key":"e_1_3_2_51_2","first-page":"467","article-title":"Understanding gnn computational graph: A coordinated computation, io, and memory perspective","volume":"4","author":"Zhang Hengrui","year":"2022","unstructured":"Hengrui Zhang, Zhongming Yu, Guohao Dai, Guyue Huang, Yufei Ding, Yuan Xie, and Yu Wang. 2022. Understanding gnn computational graph: A coordinated computation, io, and memory perspective. Proceedings of Machine Learning and Systems 4, 1 (2022), 467\u2013484. Retrieved April 29, 2025 from https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2022\/hash\/b559156047e50cf316207249d0b5a6c5-Abstract.html","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575723"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2013.6670336"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730586","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T12:20:20Z","timestamp":1751458820000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730586"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":52,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3730586"],"URL":"https:\/\/doi.org\/10.1145\/3730586","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2024-12-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}