{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,12]],"date-time":"2025-07-12T22:44:13Z","timestamp":1752360253328,"version":"3.38.0"},"reference-count":33,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T00:00:00Z","timestamp":1727654400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"name":"Exascale Computing Project","award":["17-SC-20-S"],"award-info":[{"award-number":["17-SC-20-S"]}]},{"DOI":"10.13039\/100006132","name":"Office of Science of the U.S. DOE","doi-asserted-by":"crossref","award":["DE-AC02-06CH11357, DE-AC05-00OR22725"],"award-info":[{"award-number":["DE-AC02-06CH11357, DE-AC05-00OR22725"]}],"id":[{"id":"10.13039\/100006132","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:p> We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5\u00d7 faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8\u00d7 faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9\u00d7 faster than NVIDIA\u2019s cuDSS solver. <\/jats:p>","DOI":"10.1177\/10943420241288567","type":"journal-article","created":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T08:38:44Z","timestamp":1727685524000},"page":"18-31","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression"],"prefix":"10.1177","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5854-157X","authenticated-orcid":false,"given":"Lisa","family":"Claus","sequence":"first","affiliation":[{"name":"National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"}]},{"given":"Pieter","family":"Ghysels","sequence":"additional","affiliation":[{"name":"Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"}]},{"given":"Wajih Halim","family":"Boukaram","sequence":"additional","affiliation":[{"name":"Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0747-698X","authenticated-orcid":false,"given":"Xiaoye Sherry","family":"Li","sequence":"additional","affiliation":[{"name":"Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"}]}],"member":"179","published-online":{"date-parts":[[2024,9,30]]},"reference":[{"key":"bibr1-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/2818311"},{"key":"bibr2-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/S0895479899358194"},{"key":"bibr3-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/3242094"},{"key":"bibr4-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1002\/nme.5196"},{"key":"bibr5-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2020.06.009"},{"key":"bibr6-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/18M1189348"},{"key":"bibr7-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3094091"},{"key":"bibr8-10943420241288567","unstructured":"Balay S, Abhyankar S, Adams MF, et al. (2023) PETSc Web page. https:\/\/petsc.org\/."},{"key":"bibr9-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/18M1210101"},{"key":"bibr10-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/1391989.1391995"},{"key":"bibr11-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2007.12.001"},{"key":"bibr12-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/3611662"},{"key":"bibr13-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/992200.992206"},{"key":"bibr14-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049670"},{"key":"bibr15-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/S0895479897317661"},{"key":"bibr16-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1093\/acprof:oso\/9780198508380.001.0001"},{"key":"bibr17-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356223"},{"key":"bibr18-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2022.102897"},{"key":"bibr19-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/15M1010117"},{"key":"bibr20-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.21"},{"volume-title":"WSMP: Watson Sparse Matrix Package Part II \u2013 Direct Solution of General Systems","year":"2000","author":"Gupta A","key":"bibr21-10943420241288567"},{"key":"bibr22-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(01)00141-7"},{"key":"bibr23-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/1089014.1089021"},{"key":"bibr24-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1093\/imanum\/drab020"},{"key":"bibr25-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/S1064827595287997"},{"key":"bibr26-10943420241288567","unstructured":"Li XS, Demmel JW (1998) Making sparse Gaussian elimination scalable by static pivoting Proceedings of SC98: High Performance Networking and Computing Conference, Orlando, Florida, 07-13 November 1998."},{"key":"bibr27-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1145\/779359.779361"},{"key":"bibr28-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1137\/20M1349667"},{"volume-title":"Block Low-Rank Multifrontal Solvers: Complexity, Performance, and Scalability","year":"2017","author":"Mary T","key":"bibr29-10943420241288567"},{"key":"bibr30-10943420241288567","unstructured":"Nies R, Hoelzl M (2019) Testing performance with and without block low rank compression in MUMPS and the new PaStiX 6.0 for JOREK nonlinear MHD simulations. arXiv e-prints : arXiv:1907.13442."},{"key":"bibr31-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1190\/1.2759835"},{"key":"bibr32-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-61142-8_588"},{"key":"bibr33-10943420241288567","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1997.1410"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420241288567","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420241288567","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420241288567","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,2]],"date-time":"2025-03-02T04:56:19Z","timestamp":1740891379000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420241288567"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,30]]},"references-count":33,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["10.1177\/10943420241288567"],"URL":"https:\/\/doi.org\/10.1177\/10943420241288567","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"type":"print","value":"1094-3420"},{"type":"electronic","value":"1741-2846"}],"subject":[],"published":{"date-parts":[[2024,9,30]]}}}