{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T23:29:23Z","timestamp":1777937363267,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,9]],"date-time":"2021-08-09T00:00:00Z","timestamp":1628467200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"U.S. DOE Office of Science, Office of Advanced Scientific Computing Research, CENATE project","award":["66150"],"award-info":[{"award-number":["66150"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,9]]},"DOI":"10.1145\/3472456.3472478","type":"proceedings-article","created":{"date-parts":[[2021,10,5]],"date-time":"2021-10-05T18:39:57Z","timestamp":1633459197000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures"],"prefix":"10.1145","author":[{"given":"CHENHAO","family":"XIE","sequence":"first","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]},{"given":"Jieyang","family":"Chen","sequence":"additional","affiliation":[{"name":"Oak Ridge National Laboratory, United States of America"}]},{"given":"Jesun","family":"Firoz","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]},{"given":"Jiajia","family":"Li","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]},{"given":"Shuaiwen Leon","family":"Song","sequence":"additional","affiliation":[{"name":"University of Sydney, Australia"}]},{"given":"Kevin","family":"Barker","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]},{"given":"Mark","family":"Raugas","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]},{"given":"Ang","family":"Li","sequence":"additional","affiliation":[{"name":"Pacific Northwest National Laboratory, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2021,10,5]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-48096-0_50"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080231"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611974690.ch2"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020373.2020375"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2931088.2931091"},{"key":"e_1_3_2_1_6_1","volume-title":"Davis and Yifan Hu","author":"A.","year":"2011","unstructured":"Timothy\u00a0 A. Davis and Yifan Hu . 2011 . The University of Florida Sparse Matrix Collection . ACM Trans. Math. Softw.(2011). Timothy\u00a0A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw.(2011)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976137.9"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/567806.567810"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"I.\u00a0S. Duff and J.\u00a0K. Reid. 1996. The Design of MA48: A Code for the Direct Solution of Sparse Unsymmetric Linear Systems of Equations. ACM Trans. Math. Softw.(1996).  I.\u00a0S. Duff and J.\u00a0K. Reid. 1996. The Design of MA48: A Code for the Direct Solution of Sparse Unsymmetric Linear Systems of Equations. ACM Trans. Math. Softw.(1996).","DOI":"10.1145\/229473.229476"},{"key":"e_1_3_2_1_10_1","volume-title":"2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).","author":"Dufrechou E.","unstructured":"E. Dufrechou and P. Ezzatti . 2018. A New GPU Algorithm to Compute a Level Set-Based Analysis for the Parallel Solution of Sparse Triangular Systems . In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). E. Dufrechou and P. Ezzatti. 2018. A New GPU Algorithm to Compute a Level Set-Based Analysis for the Parallel Solution of Sparse Triangular Systems. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807667"},{"key":"e_1_3_2_1_12_1","unstructured":"STFC Rutherford\u00a0Appleton Laboratory. 2019. HSL. A collection of Fortran codes for large scale scientific computation.http:\/\/www.hsl.rl.ac.uk\/.  STFC Rutherford\u00a0Appleton Laboratory. 2019. HSL. A collection of Fortran codes for large scale scientific computation.http:\/\/www.hsl.rl.ac.uk\/."},{"key":"e_1_3_2_1_13_1","volume-title":"2018 IEEE International Symposium on Workload Characterization.","author":"Li Ang","unstructured":"Ang Li , Shuaiwen\u00a0Leon Song , Jieyang Chen , Xu Liu , Nathan Tallent , and Kevin Barker . [n.d.]. Tartan : evaluating modern GPU interconnect via a multi-GPU benchmark suite . In 2018 IEEE International Symposium on Workload Characterization. Ang Li, Shuaiwen\u00a0Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. [n.d.]. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462181"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2015.105"},{"key":"e_1_3_2_1_16_1","volume-title":"Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience","author":"Liu Weifeng","year":"2017","unstructured":"Weifeng Liu , Ang Li , Jonathan\u00a0 D Hogg , Iain\u00a0 S Duff , and Brian Vinter . 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience ( 2017 ). Weifeng Liu, Ang Li, Jonathan\u00a0D Hogg, Iain\u00a0S Duff, and Brian Vinter. 2017. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience (2017)."},{"key":"e_1_3_2_1_17_1","volume-title":"Efficient Block Algorithms for Parallel Sparse Triangular Solve. In 49th International Conference on Parallel Processing.","author":"Lu Zhengyang","year":"2020","unstructured":"Zhengyang Lu , Yuyao Niu , and Weifeng Liu . 2020 . Efficient Block Algorithms for Parallel Sparse Triangular Solve. In 49th International Conference on Parallel Processing. Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient Block Algorithms for Parallel Sparse Triangular Solve. In 49th International Conference on Parallel Processing."},{"key":"e_1_3_2_1_18_1","volume-title":"Parallel algorithms for solving linear systems with sparse triangular matrices. Computing","author":"Mayer Jan","year":"2009","unstructured":"Jan Mayer . 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing ( 2009 ). Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing (2009)."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124534"},{"key":"e_1_3_2_1_20_1","unstructured":"Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. (2011).  Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. (2011)."},{"key":"e_1_3_2_1_21_1","unstructured":"NVIDIA. 2020. NVIDIA DGX. https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-systems\/.  NVIDIA. 2020. NVIDIA DGX. https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-systems\/."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126914"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Jongsoo Park Mikhail Smelyanskiy Narayanan Sundaram and Pradeep Dubey. 2014. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In ISC.  Jongsoo Park Mikhail Smelyanskiy Narayanan Sundaram and Pradeep Dubey. 2014. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In ISC.","DOI":"10.1007\/978-3-319-07518-1_8"},{"key":"e_1_3_2_1_24_1","volume-title":"GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 2017 IEEE 24th International Conference on High Performance Computing.","author":"Potluri S.","unstructured":"S. Potluri , A. Goswami , D. Rossetti , C.\u00a0 J. Newburn , M.\u00a0 G. Venkata , and N. Imam . 2017 . GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 2017 IEEE 24th International Conference on High Performance Computing. S. Potluri, A. Goswami, D. Rossetti, C.\u00a0J. Newburn, M.\u00a0G. Venkata, and N. Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 2017 IEEE 24th International Conference on High Performance Computing."},{"key":"e_1_3_2_1_25_1","volume-title":"Simplifying Multi-GPU Communication with NVSHMEM. In GPU Technology Conference.","author":"Potluri Sreeram","year":"2016","unstructured":"Sreeram Potluri , Nathan Luehr , and Nikolay Sakharnykh . 2016 . Simplifying Multi-GPU Communication with NVSHMEM. In GPU Technology Conference. Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. In GPU Technology Conference."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"crossref","unstructured":"I.\u00a0Z. Reguly G.\u00a0R. Mudalige C. Bertolli M.\u00a0B. Giles A. Betts P.\u00a0H.\u00a0J. Kelly and D. Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems (2016).  I.\u00a0Z. Reguly G.\u00a0R. Mudalige C. Bertolli M.\u00a0B. Giles A. Betts P.\u00a0H.\u00a0J. Kelly and D. Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems (2016).","DOI":"10.1109\/TPDS.2015.2453972"},{"key":"e_1_3_2_1_27_1","unstructured":"Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal.https:\/\/devblogs.nvidia.com\/beyond-gpu-memory-limits-unified-memory-pascal\/.  Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal.https:\/\/devblogs.nvidia.com\/beyond-gpu-memory-limits-unified-memory-pascal\/."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330357"},{"key":"e_1_3_2_1_29_1","volume-title":"CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs. In 49th International Conference on Parallel Processing.","author":"Su Jiya","year":"2020","unstructured":"Jiya Su , Feng Zhang , Weifeng Liu , Bingsheng He , Ruofan Wu , Xiaoyong Du , and Rujia Wang . 2020 . CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs. In 49th International Conference on Parallel Processing. Jiya Su, Feng Zhang, Weifeng Liu, Bingsheng He, Ruofan Wu, Xiaoyong Du, and Rujia Wang. 2020. CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs. In 49th International Conference on Parallel Processing."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Ehsan Totoni Michael\u00a0T Heath and Laxmikant\u00a0V Kale. 2014. Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Comput. (2014).  Ehsan Totoni Michael\u00a0T Heath and Laxmikant\u00a0V Kale. 2014. Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Comput. (2014).","DOI":"10.1016\/j.parco.2014.06.006"},{"key":"e_1_3_2_1_31_1","unstructured":"Richard Vuduc Shoaib Kamil Jen Hsu Rajesh Nishtala James\u00a0W Demmel and Katherine\u00a0A Yelick. 2002. Automatic performance tuning and analysis of sparse triangular solve.  Richard Vuduc Shoaib Kamil Jen Hsu Rajesh Nishtala James\u00a0W Demmel and Katherine\u00a0A Yelick. 2002. Automatic performance tuning and analysis of sparse triangular solve."},{"key":"e_1_3_2_1_32_1","volume-title":"Automatic performance tuning of sparse matrix kernels","author":"Vuduc Richard\u00a0Wilson","unstructured":"Richard\u00a0Wilson Vuduc . 2003. Automatic performance tuning of sparse matrix kernels . University of California , Berkeley Berkeley, CA . Richard\u00a0Wilson Vuduc. 2003. Automatic performance tuning of sparse matrix kernels. University of California, Berkeley Berkeley, CA."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178513"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225071"},{"key":"e_1_3_2_1_35_1","volume-title":"International Conference on High Performance Computing for Computational Science.","author":"Wolf M","year":"2010","unstructured":"Michael\u00a0 M Wolf , Michael\u00a0 A Heroux , and Erik\u00a0 G Boman . 2010 . Factors impacting performance of multithreaded sparse triangular solve . In International Conference on High Performance Computing for Computational Science. Michael\u00a0M Wolf, Michael\u00a0A Heroux, and Erik\u00a0G Boman. 2010. Factors impacting performance of multithreaded sparse triangular solve. In International Conference on High Performance Computing for Computational Science."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Qi Yu Bruce Childers Libo Huang Cheng Qian and Zhiying Wang. 2019. A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing(2019).  Qi Yu Bruce Childers Libo Huang Cheng Qian and Zhiying Wang. 2019. A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing(2019).","DOI":"10.1007\/s11227-019-03079-y"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2429384.2429468"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178495"}],"event":{"name":"ICPP 2021: 50th International Conference on Parallel Processing","location":"Lemont IL USA","acronym":"ICPP 2021"},"container-title":["50th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472478","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3472456.3472478","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:11Z","timestamp":1750193291000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472478"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,9]]},"references-count":38,"alternative-id":["10.1145\/3472456.3472478","10.1145\/3472456"],"URL":"https:\/\/doi.org\/10.1145\/3472456.3472478","relation":{},"subject":[],"published":{"date-parts":[[2021,8,9]]},"assertion":[{"value":"2021-10-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}