{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T06:23:47Z","timestamp":1770704627277,"version":"3.49.0"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"14","license":[{"start":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T00:00:00Z","timestamp":1717027200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T00:00:00Z","timestamp":1717027200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]},{"name":"Major Science and Technology Special Projects in Henan Province","award":["221100210600"],"award-info":[{"award-number":["221100210600"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Sparse general matrix\u2013matrix multiplication (SpGEMM) is a crucial and complex computational task in many practical applications. Improving the performance of SpGEMM on SIMT processors like modern GPUs is challenging due to the unpredictable sparsity of sparse matrices. Although existing GPU solutions have made progress in improving performance through advanced algorithm design, they ignore some optimizations related to specific processor architectures. This can result in a partially inefficient implementation of their algorithms. This paper focuses on optimizing four inefficient parts of the NSparse algorithm on DCU (a GPU-like accelerator). The optimizations include: 1) setting parameters to improve the load balance of the second matrix by extracting maximum row information at runtime; 2) reducing overhead of binning operations by making full use of registers and shared memory effectively; 3) improving numerical SpGEMM performance by adjusting its calculation mode; and 4) enhancing global load balance through finer-grained grouping and kernel configurations. Experiment results demonstrate that when compared to five state-of-the-art SpGEMM algorithms (bhSparse, KokkosKernels, NSparse, rocSparse, and spECK), our optimized method achieves an average of 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups on 29 sparse matrices with different sparse structures, respectively.<\/jats:p>","DOI":"10.1007\/s11227-024-06234-2","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T17:02:14Z","timestamp":1717088534000},"page":"20176-20200","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Optimizing sparse general matrix\u2013matrix multiplication for DCUs"],"prefix":"10.1007","volume":"80","author":[{"given":"Hengliang","family":"Guo","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haolei","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wanting","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Congxiang","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yubo","family":"Han","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shengguang","family":"Zhu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dujuan","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang","family":"Guo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiandong","family":"Shang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tao","family":"Wan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qingyang","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gang","family":"Wu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"issue":"4","key":"6234_CR1","doi-asserted-by":"publisher","first-page":"C123","DOI":"10.1137\/110838844","volume":"34","author":"N Bell","year":"2012","unstructured":"Bell N, Dalton S, Olson LN (2012) Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J Sci Comput 34(4):C123\u2013C152. https:\/\/doi.org\/10.1137\/110838844","journal-title":"SIAM J Sci Comput"},{"issue":"3","key":"6234_CR2","doi-asserted-by":"publisher","first-page":"C203","DOI":"10.1137\/15M1028807","volume":"38","author":"G Ballard","year":"2016","unstructured":"Ballard G, Siefert C, Hu J (2016) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM J Sci Comput 38(3):C203\u2013C231. https:\/\/doi.org\/10.1137\/15M1028807","journal-title":"SIAM J Sci Comput"},{"key":"6234_CR3","doi-asserted-by":"publisher","unstructured":"Then M, Kaufmann M, Chirigati F, et\u00a0al (2014) The more the merrier: efficient multi-source graph traversal. Proc VLDB Endow 8(4):449\u2013460. https:\/\/doi.org\/10.14778\/2735496.2735507","DOI":"10.14778\/2735496.2735507"},{"key":"6234_CR4","doi-asserted-by":"publisher","unstructured":"Bulu\u00e7 A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. In: Conference on High Performance Computing Networking, Storage and Analysis, pp 65:1\u201365:12. https:\/\/doi.org\/10.1145\/2063384.2063471","DOI":"10.1145\/2063384.2063471"},{"key":"6234_CR5","doi-asserted-by":"publisher","unstructured":"Kaplan H, Sharir M, Verbin E (2006) Colored intersection searching via sparse rectangular matrix multiplication. In: Proceedings of the 22nd ACM Symposium on Computational Geometry, pp 52\u201360. https:\/\/doi.org\/10.1145\/1137856.1137866","DOI":"10.1145\/1137856.1137866"},{"key":"6234_CR6","doi-asserted-by":"publisher","unstructured":"Davis TA (2018) Graph algorithms via suitesparse: graphblas: triangle counting and k-truss. In: 2018 IEEE High Performance Extreme Computing Conference, pp 1\u20136. https:\/\/doi.org\/10.1109\/HPEC.2018.8547538","DOI":"10.1109\/HPEC.2018.8547538"},{"key":"6234_CR7","doi-asserted-by":"publisher","unstructured":"Azad A, Bulu\u00e7 A, Gilbert JR (2015) Parallel triangle counting and enumeration using matrix algebra. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp 804\u2013811. https:\/\/doi.org\/10.1109\/IPDPSW.2015.75","DOI":"10.1109\/IPDPSW.2015.75"},{"issue":"4","key":"6234_CR8","doi-asserted-by":"publisher","first-page":"496","DOI":"10.1177\/1094342011403516","volume":"25","author":"A Bulu\u00e7","year":"2011","unstructured":"Bulu\u00e7 A, Gilbert JR (2011) The combinatorial blas: design, implementation, and applications. Int J High Perform Comput Appl 25(4):496\u2013509. https:\/\/doi.org\/10.1177\/1094342011403516","journal-title":"Int J High Perform Comput Appl"},{"key":"6234_CR9","doi-asserted-by":"publisher","unstructured":"Niu Q, Lai PW, Faisal SM, et\u00a0al (2014) A fast implementation of MLR-MCL algorithm on multi-core processors. In: 2014 21st International Conference on High Performance Computing (HiPC), pp 1\u201310. https:\/\/doi.org\/10.1109\/HiPC.2014.7116888","DOI":"10.1109\/HiPC.2014.7116888"},{"key":"6234_CR10","doi-asserted-by":"publisher","unstructured":"Bustamam A, Burrage K, Hamilton NA (2010) A GPU implementation of fast parallel Markov clustering in bioinformatics using ellpack-r sparse data format. In: 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp 173\u2013175. https:\/\/doi.org\/10.1109\/ACT.2010.10","DOI":"10.1109\/ACT.2010.10"},{"key":"6234_CR11","doi-asserted-by":"publisher","unstructured":"Nagasaka Y, Nukada A, Matsuoka S (2017) High-performance and memory-saving sparse general matrix-matrix multiplication for nvidia pascal gpu. In: Proceedings of the 46th International Conference on Parallel Process. (ICPP), pp 101\u2013110. https:\/\/doi.org\/10.1109\/ICPP.2017.19","DOI":"10.1109\/ICPP.2017.19"},{"key":"6234_CR12","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-024-05996-z","author":"P Han","year":"2024","unstructured":"Han P, Hua H, Wang H et al (2024) A universal parallel simulation framework for energy pipeline networks on high-performance computers. J Supercomput. https:\/\/doi.org\/10.1007\/s11227-024-05996-z","journal-title":"J Supercomput"},{"issue":"2","key":"6234_CR13","doi-asserted-by":"publisher","first-page":"2381","DOI":"10.1007\/s11227-023-05422-w","volume":"80","author":"H Guo","year":"2024","unstructured":"Guo H, Zhang L, Zhang Y et al (2024) Openmp offloading data transfer optimization for DCUs. J Supercomput 80(2):2381\u20132402. https:\/\/doi.org\/10.1007\/s11227-023-05422-w","journal-title":"J Supercomput"},{"key":"6234_CR14","doi-asserted-by":"crossref","unstructured":"Niu J, Gao W, Han L, et\u00a0al (2023) A DCU code generation and optimization method based on polyhedral model. In: International Conference on Cloud Computing, Performance Computing, and Deep Learning (CCPCDL 2023), SPIE, pp 416\u2013428","DOI":"10.1117\/12.2678907"},{"key":"6234_CR15","doi-asserted-by":"publisher","unstructured":"Zhou QW, Li JN, Zhao RC, et\u00a0al (2023) Compilation optimization of DCU-oriented openMP thread scheduling. In: Journal of Physics: Conference Series, IOP Publishing, p 012003. https:\/\/doi.org\/10.1088\/1742-6596\/2558\/1\/012003","DOI":"10.1088\/1742-6596\/2558\/1\/012003"},{"key":"6234_CR16","doi-asserted-by":"publisher","unstructured":"Hua H, Jin Q, Zhang Y, et\u00a0al (2023) Immersed boundary method of two-phase flow based on DCU parallel acceleration. In: International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023), SPIE, pp 265\u2013274. https:\/\/doi.org\/10.1117\/12.2681641","DOI":"10.1117\/12.2681641"},{"key":"6234_CR17","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1016\/j.jpdc.2015.06.010","volume":"85","author":"W Liu","year":"2015","unstructured":"Liu W, Vinter B (2015) A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors. J Parallel Distrib Comput 85:47\u201361. https:\/\/doi.org\/10.1016\/j.jpdc.2015.06.010","journal-title":"J Parallel Distrib Comput"},{"key":"6234_CR18","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1016\/j.parco.2018.06.009","volume":"78","author":"M Deveci","year":"2018","unstructured":"Deveci M, Trott C, Rajamanickam S (2018) Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput 78:33\u201346. https:\/\/doi.org\/10.1016\/j.parco.2018.06.009","journal-title":"Parallel Comput"},{"key":"6234_CR19","unstructured":"AMD (2023) Rocsparse documentation. https:\/\/rocsparse.readthedocs.io\/en\/master. Accessed 22 December 2023"},{"key":"6234_CR20","doi-asserted-by":"publisher","unstructured":"Parger M, Winter M, Mlakar D, et\u00a0al (2020) Speck: Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 362\u2013375. https:\/\/doi.org\/10.1145\/3332466.3374521","DOI":"10.1145\/3332466.3374521"},{"key":"6234_CR21","doi-asserted-by":"publisher","first-page":"85960","DOI":"10.1109\/ACCESS.2022.3196940","volume":"10","author":"Z Du","year":"2022","unstructured":"Du Z et al (2022) OpSparse: a highly optimized framework for sparse general matrix multiplication on GPUs. IEEE Access 10:85960\u201385974. https:\/\/doi.org\/10.1109\/ACCESS.2022.3196940","journal-title":"IEEE Access"},{"issue":"3","key":"6234_CR22","doi-asserted-by":"publisher","first-page":"250","DOI":"10.1145\/355791.355796","volume":"4","author":"FG Gustavson","year":"1978","unstructured":"Gustavson FG (1978) Two fast algorithms for sparse matrices: multiplication and permuted transposition. ACM Trans Math Softw 4(3):250\u2013269. https:\/\/doi.org\/10.1145\/355791.355796","journal-title":"ACM Trans Math Softw"},{"key":"6234_CR23","unstructured":"Demouth J (2012) Sparse matrix-matrix multiplication on the GPU. In: Proceedings of the GPU Technology Conference (GTC), pp 1\u201321"},{"key":"6234_CR24","doi-asserted-by":"publisher","unstructured":"Niu Y, Lu Z, Ji H, et\u00a0al (2022) Tilespgemm: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 90\u2013106. https:\/\/doi.org\/10.1145\/3503221.3508431","DOI":"10.1145\/3503221.3508431"},{"key":"6234_CR25","doi-asserted-by":"publisher","unstructured":"Winter M, Mlakar D, Zayer R, et\u00a0al (2019) Adaptive sparse matrix-matrix multiplication on the GPU. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp 68\u201381. https:\/\/doi.org\/10.1145\/3293883.3295701","DOI":"10.1145\/3293883.3295701"},{"issue":"1","key":"6234_CR26","doi-asserted-by":"publisher","first-page":"C54","DOI":"10.1137\/130948811","volume":"37","author":"F Gremse","year":"2015","unstructured":"Gremse F, Hofter A, Schwen LO et al (2015) Gpu-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM J Sci Comput 37(1):C54\u2013C71. https:\/\/doi.org\/10.1137\/130948811","journal-title":"SIAM J Sci Comput"},{"key":"6234_CR27","doi-asserted-by":"publisher","unstructured":"Niu APNQ, Fan R, Wen Y (2016) Balanced hashing and efficient GPU sparse general matrix-matrix multiplication. In: Proceedings of the 2016 International Conference on Supercomputing, pp 1\u201312. https:\/\/doi.org\/10.1145\/2925426.2926273","DOI":"10.1145\/2925426.2926273"},{"issue":"3","key":"6234_CR28","doi-asserted-by":"publisher","first-page":"441","DOI":"10.1006\/jcss.1997.1534","volume":"55","author":"E Cohen","year":"1997","unstructured":"Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55(3):441\u2013453. https:\/\/doi.org\/10.1006\/jcss.1997.1534","journal-title":"J Comput Syst Sci"},{"key":"6234_CR29","doi-asserted-by":"publisher","unstructured":"Du Z, et\u00a0al (2023) Predicting the output structure of sparse matrix multiplication with sampled compression ratio. In: 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS), pp 483\u2013490. https:\/\/doi.org\/10.1109\/ICPADS56603.2022.00069","DOI":"10.1109\/ICPADS56603.2022.00069"},{"issue":"3","key":"6234_CR30","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1007\/s10766-018-0604-8","volume":"47","author":"J Liu","year":"2019","unstructured":"Liu J, He X, Liu W et al (2019) Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int J Parallel Prog 47(3):403\u2013417. https:\/\/doi.org\/10.1007\/s10766-018-0604-8","journal-title":"Int J Parallel Prog"},{"key":"6234_CR31","doi-asserted-by":"publisher","unstructured":"Shah V, Gilbert JR (2010) Sparse matrices in matlab*p: design and implementation. In: High Performance Computing-HiPC 2004: 11th International Conference, Bangalore, India, December 19\u201322, 2004. Proceedings 11, pp 144\u2013155. https:\/\/doi.org\/10.1007\/978-3-540-30474-6_20","DOI":"10.1007\/978-3-540-30474-6_20"},{"key":"6234_CR32","unstructured":"AMD (2024) Rocm documentation. https:\/\/rocm.docs.amd.com\/projects\/HIP\/en\/latest\/index.html. Accessed 24 April 2024"},{"key":"6234_CR33","unstructured":"AMD (2023) Hip documentation. https:\/\/rocm.docs.amd.com\/projects\/HIP\/en\/latest\/index.html. Accessed 22 December 2023"},{"key":"6234_CR34","doi-asserted-by":"publisher","unstructured":"Kurt SE, Thumma V, Hong C, et\u00a0al (2017) Characterization of data movement requirements for sparse matrix computations on GPUs. In: 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp 283\u2013293.https:\/\/doi.org\/10.1109\/HiPC.2017.00040","DOI":"10.1109\/HiPC.2017.00040"},{"key":"6234_CR35","unstructured":"NVIDIA (2023) Thrust documentation. https:\/\/thrust.github.io\/doc\/index.html. Accessed 22 December 2023"},{"issue":"1","key":"6234_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2049662.2049663","volume":"38","author":"TA Davis","year":"2011","unstructured":"Davis TA, Hu Y (2011) The university of florida sparse matrix collection. ACM Trans Math Softw 38(1):1\u201325. https:\/\/doi.org\/10.1145\/2049662.2049663","journal-title":"ACM Trans Math Softw"},{"issue":"4","key":"6234_CR37","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2699470","volume":"41","author":"S Dalton","year":"2015","unstructured":"Dalton S, Olson L, Bell N (2015) Optimizing sparse matrix-matrix multiplication for the GPU. ACM Trans Math Softw (TOMS) 41(4):1\u201320. https:\/\/doi.org\/10.1145\/2699470","journal-title":"ACM Trans Math Softw (TOMS)"},{"key":"6234_CR38","unstructured":"NVIDIA (2023) Cuda documentation. https:\/\/docs.nvidia.com\/cuda\/. Accessed 22 December 2023"},{"key":"6234_CR39","unstructured":"AMD (2024) Hipify documentation. https:\/\/rocm.docs.amd.com\/projects\/HIPIFY\/en\/latest\/index.html. Accessed 21 March 2024"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-06234-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-024-06234-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-06234-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,2]],"date-time":"2024-08-02T13:48:39Z","timestamp":1722606519000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-024-06234-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,30]]},"references-count":39,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["6234"],"URL":"https:\/\/doi.org\/10.1007\/s11227-024-06234-2","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"value":"0920-8542","type":"print"},{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,30]]},"assertion":[{"value":"14 May 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 May 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}}]}}