{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T00:15:36Z","timestamp":1775607336350,"version":"3.50.1"},"reference-count":24,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T00:00:00Z","timestamp":1747094400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"ETH Research Commission","award":["ETH-03 21-1"],"award-info":[{"award-number":["ETH-03 21-1"]}]},{"name":"European Research Council","award":["833848-UEMHP"],"award-info":[{"award-number":["833848-UEMHP"]}]},{"name":"Swiss National Supercomputer Centre (CSCS) under project ETHZ-CSCS-LP01","award":["465000728"],"award-info":[{"award-number":["465000728"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>In this article, we present PfSolve\u2014a new, performant, cross-platform, and open-source implementation of tridiagonal and bidiagonal matrix solvers for the GPU architecture. Released as a stand-alone library, PfSolve can solve systems of arbitrary size that fit into the memory of a single GPU with a potential extension to multi-GPU support in the future. The code works in single, double, and double-double emulation of quad precision using only 0.1% of the original system size as additional memory. PfSolve is based on the in-house implementation of the Parallel Thomas algorithm optimized for GPU execution by using warp-level instructions and occupancy optimizations, which are discussed in detail in the article. This work also presents an accuracy analysis of the Parallel Thomas algorithm for tridiagonal matrices with various dominance factors (approximately, the ratio of the off-diagonal to diagonal terms) and demonstrates that PfSolve achieves a considerable speedup over vendor solutions on modern HPC GPUs like Nvidia H100 and AMD MI210. The source code for PfSolve is available on GitHub.<\/jats:p>","DOI":"10.1145\/3716171","type":"journal-article","created":{"date-parts":[[2025,2,4]],"date-time":"2025-02-04T06:44:01Z","timestamp":1738651441000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["High Performance Solution of Tridiagonal Systems on the GPU"],"prefix":"10.1145","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5494-7983","authenticated-orcid":false,"given":"Dmitrii","family":"Tolmachev","sequence":"first","affiliation":[{"name":"Institute of Geophysics, ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3936-1503","authenticated-orcid":false,"given":"Philippe","family":"Marti","sequence":"additional","affiliation":[{"name":"Institute of Geophysics, ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0278-6951","authenticated-orcid":false,"given":"Giacomo","family":"Castiglioni","sequence":"additional","affiliation":[{"name":"Institute of Geophysics, ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1821-4114","authenticated-orcid":false,"given":"Andrew","family":"Jackson","sequence":"additional","affiliation":[{"name":"Institute of Geophysics, ETH Zurich, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,5,13]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"AMD. 2015. OpenCL Performance and Optimization. Retrieved April 18 2024 from https:\/\/www.amd.com\/content\/dam\/amd\/en\/documents\/radeon-tech-docs\/programmer-references\/AMD_OpenCL_Programming_Optimization_Guide2.pdf"},{"key":"e_1_3_1_3_2","unstructured":"AMD. 2018. rocSPARSE Library. Retrieved April 18 2024 from https:\/\/rocm.docs.amd.com\/projects\/rocSPARSE\/en\/latest\/"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2021.3130544"},{"key":"e_1_3_1_5_2","volume-title":"Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912)","author":"Chang Li-Wen","year":"2012","unstructured":"Li-Wen Chang, John A. Stratton, Hee-Seok Kim, and Wen-Mei W. Hwu. 2012. A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912). IEEE Computer Society Press, Washington, DC, USA, Article 27, 11 pages."},{"key":"e_1_3_1_6_2","unstructured":"chipsandcheese.com. 2021. Measuring GPU Memory Latency. Retrieved April 18 2024 from https:\/\/chipsandcheese.com\/2021\/04\/16\/measuring-gpu-memory-latency\/"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2017.2723879"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/5289.911175"},{"key":"e_1_3_1_9_2","first-page":"104","article-title":"Book review: Parallel computers: Architecture, programming and algorithms. R.W. Hockney and C.R. Jesshope, Adam-Hilger, Bristol, 1981. xii + 416 pages. \u00a322.50","volume":"27","author":"Eastwood J. W.","year":"1982","unstructured":"J. W. Eastwood. 1982. Book review: Parallel computers: Architecture, programming and algorithms. R.W. Hockney and C.R. Jesshope, Adam-Hilger, Bristol, 1981. xii + 416 pages. \u00a322.50. Computer Physics Communications 27, 1 (1982), 104\u2013104. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:61108248","journal-title":"Computer Physics Communications"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1137\/20M1311053"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ARITH.2001.930115"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2020.107722"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580373"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830568"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2015.10.056"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468267.3470620"},{"key":"e_1_3_1_17_2","unstructured":"NVIDIA. 2017. cuSPARSE Library. Retrieved April 18 2024 from https:\/\/docs.nvidia.com\/cuda\/cusparse"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1017\/S0962492920000045"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.2307\/2004053"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1051\/meca\/2020013"},{"key":"e_1_3_1_21_2","volume-title":"Elliptic Problems in Linear Difference Equations Over a Network, Watson Scientific Computing Laboratory Report","author":"Thomas L.","year":"1949","unstructured":"L. Thomas. 1949. Elliptic Problems in Linear Difference Equations Over a Network, Watson Scientific Computing Laboratory Report. Technical Report. Columbia University, New York."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2023.3242240"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3585341.3585357"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2015.03.008"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/355945.355947"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3716171","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3716171","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:18:49Z","timestamp":1750295929000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3716171"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,13]]},"references-count":24,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3716171"],"URL":"https:\/\/doi.org\/10.1145\/3716171","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,13]]},"assertion":[{"value":"2024-05-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-20","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}