{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T19:26:56Z","timestamp":1777058816097,"version":"3.51.4"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,3,29]],"date-time":"2021-03-29T00:00:00Z","timestamp":1616976000000},"content-version":"vor","delay-in-days":365,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001656","name":"Helmholtz Association","doi-asserted-by":"publisher","award":["VH-NG-1241"],"award-info":[{"award-number":["VH-NG-1241"]}],"id":[{"id":"10.13039\/501100001656","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-SC0016513"],"award-info":[{"award-number":["DE-SC0016513"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2020,3,31]]},"abstract":"<jats:p>\n            Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (S\n            <jats:sc>p<\/jats:sc>\n            MV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against S\n            <jats:sc>p<\/jats:sc>\n            MV kernels provided by NVIDIA\u2019s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.\n          <\/jats:p>","DOI":"10.1145\/3380930","type":"journal-article","created":{"date-parts":[[2020,3,29]],"date-time":"2020-03-29T05:47:31Z","timestamp":1585460851000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":43,"title":["Load-balancing Sparse Matrix Vector Product Kernels on GPUs"],"prefix":"10.1145","volume":"7","author":[{"given":"Hartwig","family":"Anzt","sequence":"first","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany and University of Tennessee, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Terry","family":"Cojean","sequence":"additional","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Yen-Chen","sequence":"additional","affiliation":[{"name":"National Taiwan University, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jack","family":"Dongarra","sequence":"additional","affiliation":[{"name":"University of Tennessee, Oak Ridge National Lab, and University of Manchester, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Goran","family":"Flegar","sequence":"additional","affiliation":[{"name":"University of Jaume I, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pratik","family":"Nayak","sequence":"additional","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stanimire","family":"Tomov","sequence":"additional","affiliation":[{"name":"University of Tennessee, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuhsiang M.","family":"Tsai","sequence":"additional","affiliation":[{"name":"National Taiwan University, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weichung","family":"Wang","sequence":"additional","affiliation":[{"name":"National Taiwan University, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,3,29]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 1990 ACM\/IEEE Conference on Supercomputing (Supercomputing\u201990)","author":"Anderson Edward","year":"1990","unstructured":"Edward Anderson, Zhaojun Bai, Jack Dongarra, Anne Greenbaum, Alan McKenney, Jeremy Du Croz, Sven Hammarling, James Demmel, Christian Bischof, and Danny Sorensen. 1990. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM\/IEEE Conference on Supercomputing (Supercomputing\u201990). IEEE Computer Society Press, Los Alamitos, CA, 2\u201311. http:\/\/dl.acm.org\/citation.cfm?id&equals;110382.110385."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3324989.3325719"},{"key":"e_1_2_1_3_1","unstructured":"Hartwig Anzt Edmond Chow and Jack Dongarra. 2016. On Block-asynchronous Execution on GPUs. Technical Report 291. LAPACK Working Note."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.PARCO.2017.05.006"},{"key":"e_1_2_1_5_1","volume-title":"Implementing a Sparse Matrix Vector Product for the SELL-C\/SELL-C-&sigma","author":"Anzt Hartwig","unstructured":"Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2014. Implementing a Sparse Matrix Vector Product for the SELL-C\/SELL-C-&sigma; Formats on NVIDIA GPUs. Technical Report ut-eecs-14-727. University of Tennessee."},{"key":"e_1_2_1_6_1","volume-title":"Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods","author":"Barrett Richard","unstructured":"Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Viktor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA.","edition":"2"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654078"},{"key":"e_1_2_1_8_1","volume-title":"Retrieved","author":"Better Scientific","year":"2018","unstructured":"Better Scientific Software (BSSw). Retrieved August 2018 from https:\/\/bssw.io\/."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2013.11.024"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0895479895296689"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2015.98"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3017994"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3149704.3149767"},{"key":"e_1_2_1_14_1","volume-title":"Quintana-Ort\u00ed","author":"Flegar Goran","year":"2017","unstructured":"Goran Flegar and Enrique S. Quintana-Ort\u00ed. 2017. Balanced CSR sparse matrix-vector product on graphics processors. In Euro-Par 2017: Parallel Processing, Francisco F. Rivera, Tom\u00e1s F. Pena, and Jos\u00e9 C. Cabaleiro (Eds.). Springer International Publishing, Cham, 697\u2013709."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2950048"},{"key":"e_1_2_1_16_1","volume-title":"A survey of sparse matrix-vector multiplication performance on large matrices. CoRR abs\/1608.00636","author":"Grossman Max","year":"2016","unstructured":"Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, and Vivek Sarkar. 2016. A survey of sparse matrix-vector multiplication performance on large matrices. CoRR abs\/1608.00636 (2016). arxiv:1608.00636 http:\/\/arxiv.org\/abs\/1608.00636"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","unstructured":"Desmond Higham and Nick Higham. 2005. Matlab Guide. Society for Industrial and Applied Mathematics. DOI:https:\/\/doi.org\/10.1137\/1.9780898717891 arXiv:https:\/\/epubs.siam.org\/doi\/pdf\/10.1137\/1.9780898717891","DOI":"10.1137\/1.9780898717891"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295712"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1941553.1941590"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1137\/130930352"},{"key":"e_1_2_1_21_1","volume-title":"Meyer","author":"Langville Amy N.","year":"2012","unstructured":"Amy N. Langville and Carl D. Meyer. 2012. Google\u2019s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751209"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3014904.3014982"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2717511"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11515-8_10"},{"key":"e_1_2_1_26_1","unstructured":"NVIDIA Corp.2017. Whitepaper: NVIDIA TESLA V100 GPU ARCHITECTURE."},{"key":"e_1_2_1_27_1","volume-title":"NVIDIA CUDA Toolkit (9.2 ed.)","author":"[NVIDIA Corporation 2018] NVIDIA Corporation 2018.","unstructured":"[NVIDIA Corporation 2018] NVIDIA Corporation 2018. NVIDIA CUDA Toolkit (9.2 ed.). NVIDIA Corporation."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 7th International World Wide Web Conference","author":"Page Lawrence","year":"1998","unstructured":"Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. The PageRank citation ranking: Bringing order to the Web. In Proceedings of the 7th International World Wide Web Conference. Brisbane, Australia, 161\u2013172. citeseer.nj.nec.com\/page98pagerank.html."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2019.00088"},{"key":"e_1_2_1_30_1","volume-title":"Matrix Collection. Retrieved","year":"2018","unstructured":"SuiteSparse. 2018. Matrix Collection. Retrieved April 2018 from https:\/\/sparse.tamu.edu."},{"key":"e_1_2_1_31_1","volume-title":"Retrieved","author":"le Scientific Software Development SDK.","year":"2018","unstructured":"xSDK. Extreme-scale Scientific Software Development Kit. Retrieved August 2018 from https:\/\/xsdk.info\/."}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3380930","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3380930","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3380930","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T17:08:28Z","timestamp":1755882508000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3380930"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,29]]},"references-count":31,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,3,31]]}},"alternative-id":["10.1145\/3380930"],"URL":"https:\/\/doi.org\/10.1145\/3380930","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,29]]},"assertion":[{"value":"2018-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-03-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}