{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T10:20:04Z","timestamp":1775038804584,"version":"3.50.1"},"reference-count":23,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T00:00:00Z","timestamp":1769126400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T00:00:00Z","timestamp":1769126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62172389"],"award-info":[{"award-number":["62172389"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["CCF Trans. HPC"],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    We have optimized the parallel threshold ILU algorithm (ParILUT) for GPUs. The optimizations are for three building blocks: candidate search and ILU residual computation, adding and removing elements, and threshold selection. Firstly, we fuse candidate search and ILU residual computation by modifying the ParILUT algorithm and extending the register-aware SpGEMM algorithm to calculate it. At the same time, we developed a GPU bin search algorithm to make the register-aware SpGEMM algorithm perform better in ParILUT. Secondly, we adopt a warp-row-parallel approach to add elements to new\n                    <jats:italic>L<\/jats:italic>\n                    and\n                    <jats:italic>U<\/jats:italic>\n                    and remove elements from candidates instead of the thread-row-parallel approach. And used the efficient GPU instructions to locate the positions of elements. Thirdly, we proposed a balanced classification tree in the threshold selection to balance the buckets\u2019 data, when a large number of elements with the same value. Finally,we experimented with the performance of each optimization and the whole ParILUT. And verified the correctness of the optimized ParILUT. The result indicates that the optimized ParILUT average speedup is 4.03 times over the original version, and the speedup increases with the amount of fill-in.\n                  <\/jats:p>","DOI":"10.1007\/s42514-025-00269-4","type":"journal-article","created":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T07:13:07Z","timestamp":1769152387000},"page":"196-209","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Optimization of the ParILUT-GPU algorithm"],"prefix":"10.1007","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-1083-7894","authenticated-orcid":false,"given":"Shaofeng","family":"Yang","sequence":"first","affiliation":[]},{"given":"Zhi","family":"Li","sequence":"additional","affiliation":[]},{"given":"Yunting","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Xin","family":"He","sequence":"additional","affiliation":[]},{"given":"Guangming","family":"Tan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,1,23]]},"reference":[{"key":"269_CR1","doi-asserted-by":"crossref","unstructured":"Anzt, H., et al.: ParILUT-a parallel threshold ILU for GPUs. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, (2019)","DOI":"10.1109\/IPDPS.2019.00033"},{"issue":"4","key":"269_CR2","doi-asserted-by":"publisher","first-page":"C503","DOI":"10.1137\/16M1079506","volume":"40","author":"H Anzt","year":"2018","unstructured":"Anzt, H., Chow, E., Dongarra, J.: ParILUT\u2013a new parallel threshold ILU factorization. SIAM J. Sci. Comput. 40(4), C503\u2013C519 (2018)","journal-title":"SIAM J. Sci. Comput."},{"issue":"7\u20138","key":"269_CR3","doi-asserted-by":"publisher","first-page":"635","DOI":"10.1002\/1099-1506(200010\/12)7:7\/8<635::AID-NLA216>3.0.CO;2-B","volume":"7","author":"A Basermann","year":"2000","unstructured":"Basermann, A.: Parallel block ILUT\/ILDLT preconditioning for sparse eigenproblems and sparse linear systems\u2019\u2019. Numer. Linear Algebra Appl. 7(7\u20138), 635\u2013648 (2000)","journal-title":"Numer. Linear Algebra Appl."},{"key":"269_CR4","first-page":"88","volume":"8","author":"M Benzi","year":"1999","unstructured":"Benzi, M., Joubert, W., Mateescu, G.: Numerical experiments with parallel orderings for ILU preconditioners\u2019\u2019. Electron. Trans. Numer. Anal. 8, 88\u2013114 (1999)","journal-title":"Electron. Trans. Numer. Anal."},{"key":"269_CR5","doi-asserted-by":"crossref","unstructured":"Chan, T.F., Van Der Vorst, H.A.: Approximate and incomplete factorizations. In: Parallel numerical algorithms. Dordrecht: Springer Netherlands, 167-202 (1997)","DOI":"10.1007\/978-94-011-5412-3_6"},{"key":"269_CR6","doi-asserted-by":"crossref","unstructured":"Chow, E., Anzt, H., Dongarra, J.: Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings 30. Springer International Publishing, (2015)","DOI":"10.1007\/978-3-319-20119-1_1"},{"issue":"2","key":"269_CR7","doi-asserted-by":"publisher","first-page":"C169","DOI":"10.1137\/140968896","volume":"37","author":"E Chow","year":"2015","unstructured":"Chow, E., Patel, A.: Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput. 37(2), C169\u2013C193 (2015)","journal-title":"SIAM J. Sci. Comput."},{"issue":"7","key":"269_CR8","doi-asserted-by":"publisher","first-page":"739","DOI":"10.1002\/(SICI)1097-0363(19971015)25:73.0.CO;2-Y","volume":"25","author":"E Chow","year":"1997","unstructured":"Chow, E., Saad, Y.: ILUS: An incomplete LU preconditioner in sparse skyline format. Int. J. Numer. Methods Fluids 25(7), 739\u2013748 (1997). https:\/\/doi.org\/10.1002\/(SICI)1097-0363(19971015)25:73.0.CO;2-Y","journal-title":"Int. J. Numer. Methods Fluids"},{"issue":"1","key":"269_CR9","first-page":"1","volume":"38","author":"TA Davis","year":"2011","unstructured":"Davis, T.A., Yifan, H.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. (TOMS) 38(1), 1\u201325 (2011)","journal-title":"ACM Trans. Math. Softw. (TOMS)"},{"issue":"5","key":"269_CR10","doi-asserted-by":"publisher","first-page":"417","DOI":"10.1016\/0168-9274(91)90011-N","volume":"7","author":"S Doi","year":"1991","unstructured":"Doi, S.: On parallelism and convergence of incomplete LU factorizations\u2019\u2019. Appl. Numer. Math. 7(5), 417\u2013436 (1991)","journal-title":"Appl. Numer. Math."},{"key":"269_CR11","unstructured":"https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"key":"269_CR12","unstructured":"Hysom, D., Pothen, A.: Level-based incomplete LU factorization: Graph model and algorithms. Preprint UCRL-JC-150789, US Department of Energy, 17 (2002)"},{"issue":"6","key":"269_CR13","doi-asserted-by":"publisher","first-page":"2194","DOI":"10.1137\/S1064827500376193","volume":"22","author":"D Hysom","year":"2001","unstructured":"Hysom, D., Pothen, A.: A scalable parallel algorithm for incomplete factor preconditioning\u2019\u2019. SIAM J. Sci. Comput. 22(6), 2194\u20132215 (2001)","journal-title":"SIAM J. Sci. Comput."},{"key":"269_CR14","unstructured":"Innovative Computing Lab: Software distribution of MAGMA, (2023). http:\/\/icl.cs.utk.edu\/magma\/"},{"issue":"1","key":"269_CR15","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1145\/200979.200981","volume":"21","author":"MT Jones","year":"1995","unstructured":"Jones, M.T., Plassmann, P.E.: An improved incomplete Cholesky factorization. ACM Trans. Math. Softw. (TOMS) 21(1), 5\u201317 (1995)","journal-title":"ACM Trans. Math. Softw. (TOMS)"},{"key":"269_CR16","doi-asserted-by":"crossref","unstructured":"Karypis, G., Kumar, V.: Parallel threshold-based ILU factorization, in 1997 ACM\/IEEE Conference on Supercomputing, , pp. 1\u201324 (1997)","DOI":"10.1145\/509593.509621"},{"issue":"2","key":"269_CR17","doi-asserted-by":"publisher","first-page":"716","DOI":"10.1137\/S1064827502405094","volume":"25","author":"N Li","year":"2003","unstructured":"Li, N., Saad, Y., Chow, E.: Crout versions of ILU for general sparse matrices. SIAM J. Sci. Comput. 25(2), 716\u2013728 (2003)","journal-title":"SIAM J. Sci. Comput."},{"key":"269_CR18","doi-asserted-by":"crossref","unstructured":"Liesen, J., Strakos, Z.: Krylov subspace methods: principles and analysis. Numer. Math. Scie., (2013)","DOI":"10.1093\/acprof:oso\/9780199655410.001.0001"},{"issue":"3","key":"269_CR19","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1007\/s10766-018-0604-8","volume":"47","author":"J Liu","year":"2019","unstructured":"Liu, J., et al.: Register-aware optimizations for parallel sparse matrix-matrix multiplication. Int. J. Parallel Program. 47(3), 403\u2013417 (2019)","journal-title":"Int. J. Parallel Program."},{"key":"269_CR20","unstructured":"Lukarski, D.: Parallel sparse linear algebra for multi-core and manycore platforms - parallel solvers and preconditioners, Ph.D. dissertation, Karlsruhe Institute of Technology (KIT), Germany, (2012)"},{"key":"269_CR21","doi-asserted-by":"publisher","first-page":"1394","DOI":"10.1137\/0724090","volume":"24","author":"EL Poole","year":"1987","unstructured":"Poole, E.L., Ortega, J.M.: Multicolor ICCG methods for vector computers\u2019\u2019. SIAM J. Numer. Anal. 24, 1394\u20131417 (1987)","journal-title":"SIAM J. Numer. Anal."},{"issue":"155","key":"269_CR22","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1090\/S0025-5718-1981-0616364-6","volume":"37","author":"Y Saad","year":"1981","unstructured":"Saad, Y.: Krylov subspace methods for solving large unsymmetric linear systems. Math. Comput. 37(155), 105\u2013126 (1981)","journal-title":"Math. Comput."},{"issue":"4","key":"269_CR23","doi-asserted-by":"publisher","first-page":"387","DOI":"10.1002\/nla.1680010405","volume":"1","author":"Y Saad","year":"1994","unstructured":"Saad, Y.: ILUT: A dual threshold incomplete LU factorization. Numer. Linear Algebra Appl. 1(4), 387\u2013402 (1994)","journal-title":"Numer. Linear Algebra Appl."}],"container-title":["CCF Transactions on High Performance Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-025-00269-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42514-025-00269-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42514-025-00269-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T08:08:33Z","timestamp":1775030913000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42514-025-00269-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,23]]},"references-count":23,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["269"],"URL":"https:\/\/doi.org\/10.1007\/s42514-025-00269-4","relation":{},"ISSN":["2524-4922","2524-4930"],"issn-type":[{"value":"2524-4922","type":"print"},{"value":"2524-4930","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,23]]},"assertion":[{"value":"4 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 November 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}