{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,14]],"date-time":"2026-02-14T10:25:50Z","timestamp":1771064750475,"version":"3.50.1"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,6,8]],"date-time":"2024-06-08T00:00:00Z","timestamp":1717804800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this article, we design<jats:sc>TLPGNN<\/jats:sc>, a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,<jats:italic>vertex parallelism<\/jats:italic>for the first level and<jats:italic>feature parallelism<\/jats:italic>for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale<jats:sc>TLPGNN<\/jats:sc>to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1\u00d7, 7.7\u00d7, and 3.0\u00d7, respectively, on average. Evaluations of multiple-GPU<jats:sc>TLPGNN<\/jats:sc>also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.<\/jats:p>","DOI":"10.1145\/3644712","type":"journal-article","created":{"date-parts":[[2024,2,9]],"date-time":"2024-02-09T11:54:40Z","timestamp":1707479680000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["TLPGNN: A Lightweight Two-level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs"],"prefix":"10.1145","volume":"11","author":[{"given":"Qiang","family":"Fu","sequence":"first","affiliation":[{"name":"Advanced Micro Devices Inc Austin, Austin, USA"}]},{"given":"Yuede","family":"Ji","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of North Texas, Denton, USA"}]},{"given":"Thomas","family":"Rolinger","sequence":"additional","affiliation":[{"name":"NVIDIA, Austin, USA"}]},{"given":"H. Howie","family":"Huang","sequence":"additional","affiliation":[{"name":"The George Washington University, Washington, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,6,8]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Mart\u00edn Abadi Paul Barham Jianmin Chen Zhifeng Chen Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Geoffrey Irving Michael Isard Manjunath Kudlur Josh Levenberg Rajat Monga Sherry Moore Derek G. Murray Benoit Steiner Paul Tucker Vijay Vasudevan Pete Warden Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). USENIX Association Savannah GA 265\u2013283."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.110"},{"key":"e_1_3_2_4_2","unstructured":"Siddhant Arora. 2020. A survey on graph neural networks for knowledge graph completion. arXiv preprint arXiv:2007.12374 (2020)."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2015.12"},{"key":"e_1_3_2_6_2","unstructured":"Alaa Bessadok Mohamed Ali Mahjoub and Islem Rekik. 2021. Graph neural networks in network neuroscience. arXiv preprint arXiv:2106.03535 (2021)."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3078597.3078616"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370321"},{"key":"e_1_3_2_9_2","unstructured":"Xavier Bresson and Thomas Laurent. 2017. Residual gated graph convnets. arXiv preprint arXiv:1711.07553 (2017)."},{"key":"e_1_3_2_10_2","unstructured":"Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao Bing Xu Chiyuan Zhang and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)."},{"key":"e_1_3_2_11_2","unstructured":"Tianqi Chen Thierry Moreau Ziheng Jiang Lianmin Zheng Eddie Yan Haichen Shen Meghan Cowan Leyuan Wang Yuwei Hu Luis Ceze Carlos Guestrin and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association Carlsbad CA 578\u2013594."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330925"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2365952.2365968"},{"key":"e_1_3_2_14_2","unstructured":"Vijay Prakash Dwivedi Chaitanya K. Joshi Anh Tuan Luu Thomas Laurent Yoshua Bengio and Xavier Bresson. 2023. Benchmarking graph neural networks. Journal of Machine Learning Research 24 43 (2023) 1\u201348."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313488"},{"key":"e_1_3_2_16_2","unstructured":"Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019)."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3473511"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380297"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41524-021-00554-0"},{"key":"e_1_3_2_20_2","unstructured":"Yang Gao Yi-Fan Li Yu Lin Hang Gao and Latifur Khan. 2020. Deep learning on knowledge graph for recommender system: A survey. arXiv preprint arXiv:2004.00387 (2020)."},{"key":"e_1_3_2_21_2","first-page":"432","volume-title":"Proceedings of the SAI Intelligent Systems Conference","author":"Hu Dichao","year":"2019","unstructured":"Dichao Hu. 2019. An introductory survey on attention mechanisms in NLP problems. In Proceedings of the SAI Intelligent Systems Conference. Springer, 432\u2013448."},{"key":"e_1_3_2_22_2","unstructured":"Weihua Hu Matthias Fey Marinka Zitnik Yuxiao Dong Hongyu Ren Bowen Liu Michele Catasta and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems Vol. 33 22118\u201322133."},{"key":"e_1_3_2_23_2","unstructured":"Yuwei Hu Zihao Ye Minjie Wang Jiali Yu Da Zheng Mu Li Zheng Zhang Zhiru Zhang and Yida Wang. 2020. Featgraph: A flexible and efficient backend for graph neural network systems. In SC20: International Conference for High Performance Computing Networking Storage and Analysis IEEE 1\u201313."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441585"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3369583.3392690"},{"key":"e_1_3_2_26_2","first-page":"731","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Ji Yuede","year":"2018","unstructured":"Yuede Ji, Hang Liu, and H Howie Huang. 2018. ispan: Parallel identification of strongly connected components with spanning trees. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 731\u2013742."},{"key":"e_1_3_2_27_2","first-page":"187","article-title":"Improving the accuracy, scalability, and performance of graph neural networks with roc","volume":"2","author":"Jia Zhihao","year":"2020","unstructured":"Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems 2 (2020), 187\u2013198.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/2600212.2600227"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00010"},{"key":"e_1_3_2_30_2","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)."},{"key":"e_1_3_2_31_2","first-page":"411","volume-title":"Proceedings of the 2019 \\(\\lbrace\\) USENIX \\(\\rbrace\\) Annual Technical Conference","author":"Liu Hang","year":"2019","unstructured":"Hang Liu and H. Howie Huang. 2019. Simd-x: Programming and processing of graph algorithms on gpus. In Proceedings of the 2019 \\(\\lbrace\\) USENIX \\(\\rbrace\\) Annual Technical Conference. 411\u2013428."},{"key":"e_1_3_2_32_2","doi-asserted-by":"crossref","unstructured":"Qingsong Lv Ming Ding Qiang Liu Yuxiang Chen Wenzheng Feng Siming He Chang Zhou Jianguo Jiang Yuxiao Dong and Jie Tang. 2021. Are we really making much progress? revisiting benchmarking and refining heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1150\u20131160.","DOI":"10.1145\/3447548.3467350"},{"key":"e_1_3_2_33_2","first-page":"443","volume-title":"Proceedings of the 2019 \\(\\lbrace\\) USENIX \\(\\rbrace\\) Annual Technical Conference","author":"Ma Lingxiao","year":"2019","unstructured":"Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. Neugraph: Parallel deep neural network computation on large graphs. In Proceedings of the 2019 \\(\\lbrace\\) USENIX \\(\\rbrace\\) Annual Technical Conference. 443\u2013458."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009953814988"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2818185"},{"key":"e_1_3_2_37_2","unstructured":"Paulius Micikevicius. 2012. Retrieved from https:\/\/on-demand.gputechconf.com\/gtc\/2012\/presentations\/S0514-GTC2012-GPU-Performance-Analysis.pdf. Accessed 25-August-2022."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/2567948.2576939"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522739"},{"key":"e_1_3_2_40_2","article-title":"Cuda C++ Programming Guide","year":"2021","unstructured":"NVIDIA. 2021. Cuda C++ Programming Guide. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html. Accessed 20-August-2021.","journal-title":"R"},{"key":"e_1_3_2_41_2","unstructured":"NVIDIA. 2021. cuSPARSE. Retrieved from https:\/\/developer.nvidia.com\/cusparse. Accessed 25-August-2021."},{"key":"e_1_3_2_42_2","unstructured":"NVIDIA. 2021. Nvidia Nsight Compute. Retrieved from https:\/\/developer.nvidia.com\/nsight-compute"},{"key":"e_1_3_2_43_2","unstructured":"NVIDIA. 2023. Overview of NCCL. Retrieved from https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/overview.html. Accessed 25-August-2021."},{"key":"e_1_3_2_44_2","unstructured":"Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf Edward Yang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai and Soumith Chintala. 2019. PyTorch: An imperative style high-performance deep learning library. In Advances in Neural Information Processing Systems Vol. 32."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00034"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522740"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442530"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00100"},{"key":"e_1_3_2_49_2","unstructured":"Petar Veli\u010dkovi\u0107 Guillem Cucurull Arantxa Casanova Adriana Romero Pietro Lio and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)."},{"key":"e_1_3_2_50_2","unstructured":"Minjie Wang Da Zheng Zihao Ye Quan Gan Mufei Li Xiang Song Jinjing Zhou Chao Ma Lingfan Yu Yu Gai Tianjun Xiao Tong He George Karypis Jinyang Li and Zheng Zhang. 2019. Deep graph library: A graph-centric highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019)."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851145"},{"key":"e_1_3_2_52_2","unstructured":"Yuke Wang Boyuan Feng Gushu Li Shuangchen Li Lei Deng Yuan Xie and Yufei Ding. 2021. GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201921) 515\u2013531."},{"key":"e_1_3_2_53_2","article-title":"Biological network \u2014 Wikipedia, The Free Encyclopedia","author":"contributors Wikipedia","year":"2021","unstructured":"Wikipedia contributors. 2021. Biological network \u2014 Wikipedia, The Free Encyclopedia. Retrieved August 25, 2021 from https:\/\/en.wikipedia.org\/w\/index.php?title=Biological_network&oldid=1039989954","journal-title":"R"},{"key":"e_1_3_2_54_2","article-title":"Graph (discrete mathematics) \u2014 Wikipedia, The Free Encyclopedia","author":"contributors Wikipedia","year":"2021","unstructured":"Wikipedia contributors. 2021. Graph (discrete mathematics) \u2014 Wikipedia, The Free Encyclopedia. Retrieved August 25, 2021 from https:\/\/en.wikipedia.org\/w\/index.php?title=Graph_(discrete_mathematics)&oldid=1017809268","journal-title":"R"},{"key":"e_1_3_2_55_2","article-title":"Molecular graph \u2014 Wikipedia, The Free Encyclopedia","author":"contributors Wikipedia","year":"2021","unstructured":"Wikipedia contributors. 2021. Molecular graph \u2014 Wikipedia, The Free Encyclopedia. Retrieved August 25, 2021 from https:\/\/en.wikipedia.org\/w\/index.php?title=Molecular_graph&oldid=1032100381","journal-title":"R"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447786.3456247"},{"key":"e_1_3_2_57_2","unstructured":"Keyulu Xu Weihua Hu Jure Leskovec and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv:1810.00826. Retrieved from https:\/\/arxiv.org\/abs\/1810.00826"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219890"},{"key":"e_1_3_2_59_2","first-page":"5165","article-title":"Link prediction based on graph neural networks","volume":"31","author":"Zhang Muhan","year":"2018","unstructured":"Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems 31 (2018), 5165\u20135175.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2017.8257937"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3644712","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3644712","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:00Z","timestamp":1750287000000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3644712"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,8]]},"references-count":59,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3644712"],"URL":"https:\/\/doi.org\/10.1145\/3644712","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,8]]},"assertion":[{"value":"2023-02-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-31","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}