{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T05:44:31Z","timestamp":1769751871099,"version":"3.49.0"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T00:00:00Z","timestamp":1649635200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key R&D Programs of China","award":["2020YFB2104000"],"award-info":[{"award-number":["2020YFB2104000"]}]},{"DOI":"10.13039\/501100001809","name":"Programs of National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62172157, 61860206011 and 61806077"],"award-info":[{"award-number":["62172157, 61860206011 and 61806077"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Programs of Hunan Province, China","award":["2020RC2032, 2021RC3062, 2021JJ40109 and 2021JJ40121"],"award-info":[{"award-number":["2020RC2032, 2021RC3062, 2021JJ40109 and 2021JJ40121"]}]},{"name":"Programs of China Postdoctoral Council","award":["PC2020025, 2021M701153"],"award-info":[{"award-number":["PC2020025, 2021M701153"]}]},{"name":"Program of Zhejiang Lab","award":["2022RC0AB03"],"award-info":[{"award-number":["2022RC0AB03"]}]},{"name":"General Program of Fundamental Research of Shen Zhen","award":["JCYJ20210324135409026"],"award-info":[{"award-number":["JCYJ20210324135409026"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2022,6,30]]},"abstract":"<jats:p>Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First,<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>adopts an MPI<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( + \\)<\/jats:tex-math><\/jats:inline-formula>OpenMP<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( +X \\)<\/jats:tex-math><\/jats:inline-formula>parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-\/post-processing and main SpMSpV computation. Second,<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third,<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>uses several optimization techniques to further utilize the computing resources.<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally,<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application.<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>on a P100 GPU obtains the speedup of up to<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( 134.38\\times \\)<\/jats:tex-math><\/jats:inline-formula>over the state-of-the-art SpMSpV algorithms, and the BFS application using<jats:sans-serif>fgSpMSpV<\/jats:sans-serif>achieves the speedup of up to<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( 21.68\\times \\)<\/jats:tex-math><\/jats:inline-formula>over the state-of-the-arts.<\/jats:p>","DOI":"10.1145\/3512770","type":"journal-article","created":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T10:19:12Z","timestamp":1649672352000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5665-268X","authenticated-orcid":false,"given":"Yuedan","family":"Chen","sequence":"first","affiliation":[{"name":"College of Computer Science and Electronic Engineering, Hunan University, and National Supercomputing Center in Changsha, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5008-4829","authenticated-orcid":false,"given":"Guoqing","family":"Xiao","sequence":"additional","affiliation":[{"name":"College of Computer Science and Electronic Engineering, Hunan University, and National Supercomputing Center in Changsha, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2635-7716","authenticated-orcid":false,"given":"Kenli","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer Science and Electronic Engineering, Hunan University, and National Supercomputing Center in Changsha, Changsha, Hunan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5179-2496","authenticated-orcid":false,"given":"Francesco","family":"Piccialli","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering and Information Technologies,University of Naples Federico II, Naples, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3090-1059","authenticated-orcid":false,"given":"Albert Y.","family":"Zomaya","sequence":"additional","affiliation":[{"name":"School of Information Technologies, University of Sydney, Sidney, NSW, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,4,11]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3371275"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2656893"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3155292"},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Michael J. Anderson Narayanan Sundaram Nadathur Satish Md. Mostofa Ali Patwary Theodore L. Willke and Pradeep Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the International Parallel and Distributed Processing Symposium . 313\u2013322.","DOI":"10.1109\/IPDPS.2016.86"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.103"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.76"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00012"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1137\/18M1189348"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3015144"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3226228"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2015.2468218"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2015.2453400"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.52"},{"key":"e_1_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Aydin Bulu\u00e7 and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing Networking Storage and Analysis . 65:1\u201365:12.","DOI":"10.1145\/2063384.2063471"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2016.2565643"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2871189"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.2990429"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","unstructured":"Yuedan Chen Guoqing Xiao Fan Wu and Zhuo Tang. 2019. Towards large-scale sparse matrix-vector multiplication on the SW26010 manycore architecture. In Proceedings of the 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems . IEEE 1469\u20131476.","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2019.00203"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2699470"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/355791.355796"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2848618"},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","unstructured":"Yong-Yeon Jo Sang-Wook Kim and Duck-Ho Bae. 2015. Efficient sparse matrix multiplication on GPU for large social network analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management . 1261\u20131270.","DOI":"10.1145\/2806416.2806445"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2020.2984496"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2401575"},{"issue":"7","key":"e_1_3_1_27_2","first-page":"1842","article-title":"Adaptive SpMV\/SpMSpV on GPUs for input vectors of varied sparsity","volume":"32","author":"Li Min","year":"2021","unstructured":"Min Li, Yulong Ao, and Chao Yang. 2021. Adaptive SpMV\/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distributed Syst. 32, 7 (2021), 1842\u20131853.","journal-title":"IEEE Trans. Parallel Distributed Syst."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33014496"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2019.2894283"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2015.2415257"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2017.7858395"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2906166"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","unstructured":"Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence . 4292\u20134293.","DOI":"10.1609\/aaai.v29i1.9277"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553499"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.14778\/2809974.2809983"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/JIOT.2019.2947257"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2907537"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2015.77"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2017.2714667"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976137.5"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/2994148"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2932931"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2019.8714836"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512770","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3512770","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:29Z","timestamp":1750183769000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512770"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,11]]},"references-count":42,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,6,30]]}},"alternative-id":["10.1145\/3512770"],"URL":"https:\/\/doi.org\/10.1145\/3512770","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,11]]},"assertion":[{"value":"2020-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}