{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,11]],"date-time":"2026-06-11T22:05:40Z","timestamp":1781215540752,"version":"3.54.1"},"reference-count":52,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T00:00:00Z","timestamp":1710115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T00:00:00Z","timestamp":1710115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block\u2019s size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA\u2019s cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida\u2019s SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> and 2.91<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> on Turing RTX 2080Ti, and up to 2.99<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> and 3.22<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mo>\u00d7<\/mml:mo>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.<\/jats:p>","DOI":"10.1007\/s11227-024-05949-6","type":"journal-article","created":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T10:01:51Z","timestamp":1710151311000},"page":"13681-13713","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs"],"prefix":"10.1007","volume":"80","author":[{"given":"Zhixiang","family":"Zhao","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guoyin","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yanxia","family":"Wu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ruize","family":"Hong","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yiqing","family":"Yang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yan","family":"Fu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,3,11]]},"reference":[{"key":"5949_CR1","doi-asserted-by":"publisher","unstructured":"Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, Portland Oregon, pp 1\u201311. https:\/\/doi.org\/10.1145\/1654059.1654078","DOI":"10.1145\/1654059.1654078"},{"key":"5949_CR2","doi-asserted-by":"publisher","unstructured":"Nisa I, Siegel C, Rajam AS, Vishnu A, Sadayappan P (2018) Effective machine learning based format selection and performance modeling for spmv on gpus. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, Vancouver, BC, pp 1056\u20131065. https:\/\/doi.org\/10.1109\/IPDPSW.2018.00164","DOI":"10.1109\/IPDPSW.2018.00164"},{"issue":"4","key":"5949_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3017994","volume":"43","author":"S Filippone","year":"2017","unstructured":"Filippone S, Cardellini V, Barbieri D, Fanfarillo A (2017) Sparse matrix-vector multiplication on gpgpus. ACM Trans Math Softw 43(4):1\u201349. https:\/\/doi.org\/10.1145\/3017994","journal-title":"ACM Trans Math Softw"},{"key":"5949_CR4","doi-asserted-by":"publisher","unstructured":"Tang WT, Tan WJ, Ray R, Wong YW, Chen W, Kuo S-h, Goh RSM, Turner SJ, Wong W-F (2013) Accelerating sparse matrix-vector multiplication on gpus using bit-representation-optimized schemes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, Denver Colorado, pp 1\u201312. https:\/\/doi.org\/10.1145\/2503210.2503234","DOI":"10.1145\/2503210.2503234"},{"issue":"5","key":"5949_CR5","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1137\/130930352","volume":"36","author":"M Kreutzer","year":"2014","unstructured":"Kreutzer M, Hager G, Wellein G, Fehske H, Bishop AR (2014) A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units. SIAM J Sci Comput 36(5):401\u2013423. https:\/\/doi.org\/10.1137\/130930352","journal-title":"SIAM J Sci Comput"},{"issue":"7","key":"5949_CR6","doi-asserted-by":"publisher","first-page":"2639","DOI":"10.1016\/j.jpdc.2014.03.002","volume":"74","author":"C Zheng","year":"2014","unstructured":"Zheng C, Gu S, Gu T-X, Yang B, Liu X-P (2014) Biell: a bisection ellpack-based storage format for optimizing spmv on gpus. J Parallel Distrib Comput 74(7):2639\u20132647. https:\/\/doi.org\/10.1016\/j.jpdc.2014.03.002","journal-title":"J Parallel Distrib Comput"},{"issue":"9","key":"5949_CR7","doi-asserted-by":"publisher","first-page":"2373","DOI":"10.1109\/TPDS.2014.2357437","volume":"26","author":"WT Tang","year":"2015","unstructured":"Tang WT, Tan WJ, Goh RSM, Turner SJ, Wong W-F (2015) A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu. IEEE Trans Parallel Distrib Syst 26(9):2373\u20132385. https:\/\/doi.org\/10.1109\/TPDS.2014.2357437","journal-title":"IEEE Trans Parallel Distrib Syst"},{"issue":"3","key":"5949_CR8","doi-asserted-by":"publisher","first-page":"431","DOI":"10.1007\/s11704-014-4127-1","volume":"9","author":"CC Yan","year":"2015","unstructured":"Yan CC, Yu H, Xu W, Zhang Y, Chen B, Tian Z, Wang Y, Yin J (2015) Memory bandwidth optimization of spmv on gpgpus. Front Comput Sci 9(3):431\u2013441. https:\/\/doi.org\/10.1007\/s11704-014-4127-1","journal-title":"Front Comput Sci"},{"issue":"1","key":"5949_CR9","doi-asserted-by":"publisher","first-page":"196","DOI":"10.1109\/TPDS.2014.2308221","volume":"26","author":"K Li","year":"2015","unstructured":"Li K, Yang W, Li K (2015) Performance analysis and optimization for spmv on gpu using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26(1):196\u2013205. https:\/\/doi.org\/10.1109\/TPDS.2014.2308221","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"5949_CR10","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1016\/j.jpdc.2016.03.011","volume":"93\u201394","author":"M Maggioni","year":"2016","unstructured":"Maggioni M, Berger-Wolf T (2016) Optimization techniques for sparse matrix-vector multiplication on gpus. J Parallel Distrib Comput 93\u201394:66\u201386. https:\/\/doi.org\/10.1016\/j.jpdc.2016.03.011","journal-title":"J Parallel Distrib Comput"},{"key":"5949_CR11","doi-asserted-by":"publisher","unstructured":"Godwin J, Holewinski J, Sadayappan P (2012) High-performance sparse matrix-vector multiplication on gpus for structured grid computations. In: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM, London United Kingdom, pp 47\u201356. https:\/\/doi.org\/10.1145\/2159430.2159436","DOI":"10.1145\/2159430.2159436"},{"key":"5949_CR12","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1016\/j.jpdc.2016.12.023","volume":"104","author":"W Yang","year":"2017","unstructured":"Yang W, Li K, Li K (2017) A hybrid computing method of spmv on cpu-gpu heterogeneous computing systems. J Parallel Distrib Comput 104:49\u201360. https:\/\/doi.org\/10.1016\/j.jpdc.2016.12.023","journal-title":"J Parallel Distrib Comput"},{"issue":"3","key":"5949_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3134442","volume":"44","author":"A Elafrou","year":"2018","unstructured":"Elafrou A, Karakasis V, Gkountouvas T, Kourtis K, Goumas G, Koziris N (2018) Sparsex: a library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans Math Softw 44(3):1\u201332. https:\/\/doi.org\/10.1145\/3134442","journal-title":"ACM Trans Math Softw"},{"issue":"10","key":"5949_CR14","doi-asserted-by":"publisher","first-page":"1675","DOI":"10.3390\/electronics9101675","volume":"9","author":"S AlAhmadi","year":"2020","unstructured":"AlAhmadi S, Mohammed T, Albeshri A, Katib I, Mehmood R (2020) Performance analysis of sparse matrix-vector multiplication (spmv) on graphics processing units (gpus). Electronics 9(10):1675. https:\/\/doi.org\/10.3390\/electronics9101675","journal-title":"Electronics"},{"key":"5949_CR15","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1016\/j.ins.2020.03.020","volume":"523","author":"Y Chen","year":"2020","unstructured":"Chen Y, Xiao G, Wu F, Tang Z, Li K (2020) tpspmv: a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Inf Sci 523:279\u2013295. https:\/\/doi.org\/10.1016\/j.ins.2020.03.020","journal-title":"Inf Sci"},{"key":"5949_CR16","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1016\/j.jpdc.2021.07.007","volume":"157","author":"J Gao","year":"2021","unstructured":"Gao J, Xia Y, Yin R, He G (2021) Adaptive diagonal sparse matrix-vector multiplication on gpu. J Parallel Distrib Comput 157:287\u2013302. https:\/\/doi.org\/10.1016\/j.jpdc.2021.07.007","journal-title":"J Parallel Distrib Comput"},{"issue":"12","key":"5949_CR17","doi-asserted-by":"publisher","first-page":"3977","DOI":"10.1109\/tpds.2022.3177291","volume":"33","author":"E Karimi","year":"2022","unstructured":"Karimi E, Agostini NB, Dong S, Kaeli D (2022) Vcsr: an efficient gpu memory-aware sparse format. IEEE Trans Parallel Distrib Syst 33(12):3977\u20133989. https:\/\/doi.org\/10.1109\/tpds.2022.3177291","journal-title":"IEEE Trans Parallel Distrib Syst"},{"issue":"4","key":"5949_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3371275","volume":"16","author":"K Ahmad","year":"2019","unstructured":"Ahmad K, Sundar H, Hall M (2019) Data-driven mixed precision sparse matrix vector multiplication for gpus. ACM Trans Archit Code Optim 16(4):1\u201324. https:\/\/doi.org\/10.1145\/3371275","journal-title":"ACM Trans Archit Code Optim"},{"issue":"3","key":"5949_CR19","doi-asserted-by":"publisher","first-page":"124","DOI":"10.1137\/19M1289546","volume":"42","author":"P Blanchard","year":"2020","unstructured":"Blanchard P, Higham NJ, Lopez F, Mary T, Pranesh S (2020) Mixed precision block fused multiply-add: error analysis and application to gpu tensor cores. SIAM J Sci Comput 42(3):124\u2013141","journal-title":"SIAM J Sci Comput"},{"issue":"1","key":"5949_CR20","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/2363\/1\/012008","volume":"2363","author":"J Liu","year":"2022","unstructured":"Liu J (2022) Accuracy controllable spmv optimization on gpu. J Phys Conf Ser 2363(1):012008. https:\/\/doi.org\/10.1088\/1742-6596\/2363\/1\/012008","journal-title":"J Phys Conf Ser"},{"key":"5949_CR21","doi-asserted-by":"publisher","unstructured":"Erhan Tezcan Torun T, Kosar F, Kaya K, Unat D (2022) Mixed and multi-precision spmv for gpus with row-wise precision selection. In: 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, Bordeaux, France, pp 31\u201340. https:\/\/doi.org\/10.1109\/SBAC-PAD55451.2022.00014","DOI":"10.1109\/SBAC-PAD55451.2022.00014"},{"issue":"12","key":"5949_CR22","doi-asserted-by":"publisher","first-page":"3732","DOI":"10.1109\/TPDS.2022.3170501","volume":"33","author":"J Gao","year":"2022","unstructured":"Gao J, Ji W, Tan Z, Wang Y, Shi F (2022) Taichi: a hybrid compression format for binary sparse matrix-vector multiplication on gpu. IEEE Trans Parallel Distrib Syst 33(12):3732\u20133745. https:\/\/doi.org\/10.1109\/TPDS.2022.3170501","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"5949_CR23","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2022.101609","volume":"61","author":"K Isupov","year":"2022","unstructured":"Isupov K (2022) Multiple-precision sparse matrix-vector multiplication on gpus. J Comput Sci 61:101609. https:\/\/doi.org\/10.1016\/j.jocs.2022.101609","journal-title":"J Comput Sci"},{"key":"5949_CR24","doi-asserted-by":"publisher","unstructured":"Simecek I (2009) Sparse matrix computations using the quadtree storage format. In: 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE, Timisoara, Romania, pp 168\u2013173. https:\/\/doi.org\/10.1109\/SYNASC.2009.55","DOI":"10.1109\/SYNASC.2009.55"},{"key":"5949_CR25","doi-asserted-by":"publisher","unstructured":"Simecek I, Langr D, Tvrdik P (2012) Minimal quadtree format for compression of sparse matrices storage. In: 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE, Timisoara, Romania, pp 359\u2013364. https:\/\/doi.org\/10.1109\/SYNASC.2012.30","DOI":"10.1109\/SYNASC.2012.30"},{"key":"5949_CR26","doi-asserted-by":"publisher","first-page":"490","DOI":"10.1016\/j.future.2015.03.005","volume":"54","author":"J Zhang","year":"2016","unstructured":"Zhang J, Wan J, Li F, Mao J, Zhuang L, Yuan J, Liu E, Yu Z (2016) Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format. Future Gener Comput Syst 54:490\u2013500. https:\/\/doi.org\/10.1016\/j.future.2015.03.005","journal-title":"Future Gener Comput Syst"},{"issue":"10\u201311","key":"5949_CR27","doi-asserted-by":"publisher","first-page":"552","DOI":"10.1016\/j.parco.2012.07.002","volume":"38","author":"M Verschoor","year":"2012","unstructured":"Verschoor M, Jalba AC (2012) Analysis and performance estimation of the conjugate gradient method on multiple gpus. Parallel Comput 38(10\u201311):552\u2013575. https:\/\/doi.org\/10.1016\/j.parco.2012.07.002","journal-title":"Parallel Comput"},{"issue":"5","key":"5949_CR28","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1145\/1837853.1693471","volume":"45","author":"JW Choi","year":"2010","unstructured":"Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply on gpus. ACM Sigplan Notices 45(5):115\u2013126","journal-title":"ACM Sigplan Notices"},{"issue":"3","key":"5949_CR29","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1080\/17445760802337010","volume":"24","author":"L Buatois","year":"2009","unstructured":"Buatois L, Caumon G, L\u00e9vy B (2009) Concurrent number cruncher: a gpu implementation of a general sparse linear solver. Int J Parallel Emerg Distrib Syst 24(3):205\u2013223. https:\/\/doi.org\/10.1080\/17445760802337010","journal-title":"Int J Parallel Emerg Distrib Syst"},{"key":"5949_CR30","doi-asserted-by":"publisher","DOI":"10.1002\/9781119604570","volume-title":"An introduction to numerical methods and analysis","author":"JF Epperson","year":"2021","unstructured":"Epperson JF (2021) An introduction to numerical methods and analysis, 3rd edn. Wiley, Hoboken","edition":"3"},{"key":"5949_CR31","unstructured":"NVIDIA: Volta Architecture Whitepaper (2017). https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"5949_CR32","unstructured":"NVIDIA: Turing Architecture Whitepaper (2018). https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf"},{"key":"5949_CR33","unstructured":"NVIDIA: Ampere Architecture Whitepaper (2021). https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf"},{"key":"5949_CR34","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1007\/978-3-642-11515-8_10","volume-title":"High performance embedded architectures and compilers","author":"A Monakov","year":"2010","unstructured":"Monakov A, Lokhmotov A, Avetisyan A (2010) Automatically tuning sparse matrix-vector multiplication for gpu architectures. In: Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Patt YN, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) High performance embedded architectures and compilers, vol 5952. Springer, Berlin, pp 111\u2013125. https:\/\/doi.org\/10.1007\/978-3-642-11515-8_10"},{"key":"5949_CR35","unstructured":"Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c\/sell-c-$$\\sigma $$ formats on nvidia gpus. University of Tennessee, Tech. Rep. ut-eecs-14-727"},{"key":"5949_CR36","doi-asserted-by":"publisher","unstructured":"Yan S, Li C, Zhang Y, Zhou H (2014) yaspmv: yet another spmv framework on gpus. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, Orlando Florida USA, pp 107\u2013118. https:\/\/doi.org\/10.1145\/2555243.2555255","DOI":"10.1145\/2555243.2555255"},{"key":"5949_CR37","doi-asserted-by":"publisher","unstructured":"Merrill D, Garland M (2016) Merge-based parallel sparse matrix-vector multiplication. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, UT, USA, pp 678\u2013689. https:\/\/doi.org\/10.1109\/SC.2016.57","DOI":"10.1109\/SC.2016.57"},{"key":"5949_CR38","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1007\/978-3-319-64203-1_50","volume-title":"Euro-Par 2017: parallel processing","author":"G Flegar","year":"2017","unstructured":"Flegar G, Quintana-Ort\u00ed ES (2017) Balanced csr sparse matrix-vector product on graphics processors. In: Rivera FF, Pena TF, Cabaleiro JC (eds) Euro-Par 2017: parallel processing, vol 10417. Springer, Cham, pp 697\u2013709. https:\/\/doi.org\/10.1007\/978-3-319-64203-1_50"},{"key":"5949_CR39","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1007\/978-981-13-5910-1_7","volume-title":"Big scientific data benchmarks, architecture, and systems","author":"Y Xia","year":"2019","unstructured":"Xia Y, Gao J, He G (2019) A parallel solving algorithm on gpu for the time-domain linear system with diagonal sparse matrices. In: Ren R, Zheng C, Zhan J (eds) Big scientific data benchmarks, architecture, and systems, vol 911. Springer, Singapore, pp 73\u201384. https:\/\/doi.org\/10.1007\/978-981-13-5910-1_7"},{"key":"5949_CR40","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.6230","author":"G He","year":"2021","unstructured":"He G, Chen Q, Gao J (2021) A new diagonal storage for efficient implementation of sparse matrix-vector multiplication on graphics processing unit. Concurr Comput Pract Exp. https:\/\/doi.org\/10.1002\/cpe.6230","journal-title":"Concurr Comput Pract Exp"},{"issue":"2","key":"5949_CR41","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1177\/1094342013501126","volume":"28","author":"W Yang","year":"2014","unstructured":"Yang W, Li K, Liu Y, Shi L, Wan L (2014) Optimization of quasi-diagonal matrix\u2013vector multiplication on gpu. Int J High Perform Comput Appl 28(2):183\u2013195. https:\/\/doi.org\/10.1177\/1094342013501126","journal-title":"Int J High Perform Comput Appl"},{"key":"5949_CR42","doi-asserted-by":"publisher","first-page":"152","DOI":"10.1016\/j.jcss.2017.09.010","volume":"92","author":"W Yang","year":"2018","unstructured":"Yang W, Li K, Li K (2018) A parallel computing method using blocked format with optimal partitioning for spmv on gpu. J Comput Syst Sci 92:152\u2013170. https:\/\/doi.org\/10.1016\/j.jcss.2017.09.010","journal-title":"J Comput Syst Sci"},{"key":"5949_CR43","doi-asserted-by":"publisher","unstructured":"Niu Y, Lu Z, Dong M, Jin Z, Liu W, Tan G (2021) Tilespmv: a tiled algorithm for sparse matrix-vector multiplication on gpus. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Portland, OR, USA, pp 68\u201378. https:\/\/doi.org\/10.1109\/ipdps49936.2021.00016","DOI":"10.1109\/ipdps49936.2021.00016"},{"key":"5949_CR44","doi-asserted-by":"publisher","unstructured":"Willcock J, Lumsdaine A (2006) Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th Annual International Conference on Supercomputing. ACM, Cairns Queensland Australia, pp 307\u2013316. https:\/\/doi.org\/10.1145\/1183401.1183444","DOI":"10.1145\/1183401.1183444"},{"key":"5949_CR45","doi-asserted-by":"publisher","unstructured":"Kourtis K, Goumas G, Koziris N (2008) Optimizing sparse matrix-vector multiplication using index and value compression. In: Proceedings of the 5th Conference on Computing Frontiers. ACM, Ischia Italy, pp 87\u201396. https:\/\/doi.org\/10.1145\/1366230.1366244","DOI":"10.1145\/1366230.1366244"},{"key":"5949_CR46","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1007\/978-3-030-71593-9_7","volume-title":"Euro-Par 2020: parallel processing workshops","author":"JI Aliaga","year":"2021","unstructured":"Aliaga JI, Anzt H, Quintana-Ort\u00ed ES, Tom\u00e1s AE, Tsai YM (2021) Balanced and compressed coordinate layout for the sparse matrix-vector product on gpus. In: Balis B, Heras DB, Antonelli L, Bracciali A, Gruber T, Hyun-Wook J, Kuhn M, Scott SL, Unat D, Wyrzykowski R (eds) Euro-Par 2020: parallel processing workshops, vol 12480. Springer, Cham, pp 83\u201395. https:\/\/doi.org\/10.1007\/978-3-030-71593-9_7"},{"key":"5949_CR47","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2020.106848","volume":"88","author":"O Zachariadis","year":"2020","unstructured":"Zachariadis O, Satpute N, G\u00f3mez-Luna J, Olivares J (2020) Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Comput Electr Eng 88:106848","journal-title":"Comput Electr Eng"},{"issue":"4","key":"5949_CR48","doi-asserted-by":"publisher","first-page":"344","DOI":"10.1177\/10943420211003313","volume":"35","author":"A Abdelfattah","year":"2021","unstructured":"...Abdelfattah A, Anzt H, Boman EG, Carson E, Cojean T, Dongarra J, Fox A, Gates M, Higham NJ, Li XS, Loe J, Luszczek P, Pranesh S, Rajamanickam S, Ribizel T, Smith BF, Swirydowicz K, Thomas S, Tomov S, Tsai YM, Yang UM (2021) A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. Int J High Perform Comput Appl 35(4):344\u2013369. https:\/\/doi.org\/10.1177\/10943420211003313","journal-title":"Int J High Perform Comput Appl"},{"key":"5949_CR49","doi-asserted-by":"publisher","first-page":"347","DOI":"10.1017\/S0962492922000022","volume":"31","author":"NJ Higham","year":"2022","unstructured":"Higham NJ, Mary T (2022) Mixed precision algorithms in numerical linear algebra. Acta Numer 31:347\u2013414. https:\/\/doi.org\/10.1017\/S0962492922000022","journal-title":"Acta Numer"},{"key":"5949_CR50","unstructured":"NVIDIA: cuSPARSE Library (2021). https:\/\/docs.nvidia.com\/cuda\/archive\/11.2.1\/cusparse\/index.html"},{"key":"5949_CR51","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.6515","author":"JI Aliaga","year":"2022","unstructured":"Aliaga JI, Anzt H, Gr\u00fctzmacher T, Quintana-Ort\u00ed ES, Tom\u00e1s AE (2022) Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units. Concurr Comput Pract Exp. https:\/\/doi.org\/10.1002\/cpe.6515","journal-title":"Concurr Comput Pract Exp"},{"issue":"1","key":"5949_CR52","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2049662.2049663","volume":"38","author":"TA Davis","year":"2011","unstructured":"Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1\u201325. https:\/\/doi.org\/10.1145\/2049662.2049663","journal-title":"ACM Trans Math Softw"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-05949-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-024-05949-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-05949-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,10]],"date-time":"2024-06-10T11:14:01Z","timestamp":1718018041000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-024-05949-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,11]]},"references-count":52,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["5949"],"URL":"https:\/\/doi.org\/10.1007\/s11227-024-05949-6","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"value":"0920-8542","type":"print"},{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,11]]},"assertion":[{"value":"29 January 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 March 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}