{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T00:24:57Z","timestamp":1771979097053,"version":"3.50.1"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T00:00:00Z","timestamp":1731974400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Strategic Priority Research Program of Chinese Academy of Sciences","award":["XDB0500102"],"award-info":[{"award-number":["XDB0500102"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            Sparse Triangular Solve (SpTRSV) has long been an essential kernel in the field of scientific computing. Due to its low computational intensity and internal data dependencies, SpTRSV is hard to implement and optimize on graphics processing units (GPUs). Based on our experimental observations, existing implementations on GPUs fail to achieve the optimal performance due to their suboptimal parallelism setups and code implementations plus lack of consideration of the irregular data distribution. Moreover, their algorithm design lacks the adaptability to different input matrices, which may involve substantial manual efforts of algorithm redesigning and parameter tuning for performance consistency. In this work, we propose AG-SpTRSV, an automatic framework to optimize SpTRSV on GPUs, which provides high performance on various matrices while eliminating the costs of manual design. AG-SpTRSV abstracts the procedures of optimizing an SpTRSV kernel as a\n            <jats:italic>scheme<\/jats:italic>\n            and constructs a comprehensive optimization space based on it. By defining a unified code template and preparing code variants, AG-SpTRSV enables fine-grained dynamic parallelism and adaptive code optimizations to handle various tasks. Through computation graph transformation and multi-hierarchy heuristic scheduling, AG-SpTRSV generates schemes for task partitioning and mapping, which effectively address the issues of irregular data distribution and internal data dependencies. AG-SpTRSV searches for the best scheme to optimize the target kernel for the specific matrix. A learned lightweight performance model is also introduced to reduce search costs and provide an efficient end-to-end solution. Experimental results with SuiteSparse Matrix Collection on NVIDIA Tesla A100 and RTX 3080 Ti show that AG-SpTRSV outperforms state-of-the-art implementations with geometric average speedups of 2.12x \u223c 3.99x. With the performance model enabled, AG-SpTRSV can provide an efficient end-to-end solution, with preprocessing times ranging from 3.4 to 245 times of the execution time.\n          <\/jats:p>","DOI":"10.1145\/3674911","type":"journal-article","created":{"date-parts":[[2024,6,25]],"date-time":"2024-06-25T11:48:21Z","timestamp":1719316101000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8500-6173","authenticated-orcid":false,"given":"Zhengding","family":"Hu","sequence":"first","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5098-1503","authenticated-orcid":false,"given":"Jingwei","family":"Sun","sequence":"additional","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5436-2319","authenticated-orcid":false,"given":"Zhongyang","family":"Li","sequence":"additional","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0794-7681","authenticated-orcid":false,"given":"Guangzhong","family":"Sun","sequence":"additional","affiliation":[{"name":"Computer Science and Technology, University of Science and Technology of China, Hefei, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,19]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"529","volume-title":"European Conference on Parallel Processing","author":"Ahmad Najeeb","year":"2020","unstructured":"Najeeb Ahmad, Buse Yilmaz, and Didem Unat. 2020. A prediction framework for fast sparse triangular solves. In European Conference on Parallel Processing. Springer, 529\u2013545."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3074501"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129053389000056"},{"key":"e_1_3_2_5_2","first-page":"578","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et\u00a0al. 2018. \\(\\lbrace\\) TVM \\(\\rbrace\\) : An automated \\(\\lbrace\\) End-to-End \\(\\rbrace\\) optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578\u2013594."},{"issue":"1","key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2049662.2049663","article-title":"The University of Florida sparse matrix collection","volume":"38","author":"Davis Timothy A.","year":"2011","unstructured":"Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1\u201325.","journal-title":"ACM Transactions on Mathematical Software (TOMS)"},{"key":"e_1_3_2_7_2","doi-asserted-by":"crossref","DOI":"10.1201\/9781420003154","volume-title":"Electronics, Power Electronics, Optoelectronics, Microwaves, Electromagnetics, and Radar","author":"Dorf Richard C.","year":"2018","unstructured":"Richard C. Dorf. 2018. Electronics, Power Electronics, Optoelectronics, Microwaves, Electromagnetics, and Radar. CRC press."},{"key":"e_1_3_2_8_2","article-title":"AlphaSparse: Generating high performance SpMV codes directly from sparse matrices","author":"Du Zhen","year":"2022","unstructured":"Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating high performance SpMV codes directly from sparse matrices. arXiv preprint arXiv:2212.10432 (2022).","journal-title":"arXiv preprint arXiv:2212.10432"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/PDP2018.2018.00034"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2021.07.013"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1109\/SBAC-PAD.2019.00020","volume-title":"2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","author":"Dufrechou Ernesto","year":"2019","unstructured":"Ernesto Dufrechou, Pablo Ezzatti, and Enrique S. Quintana-Ort\u00ed. 2019. Automatic selection of sparse triangular linear system solvers on GPUs through machine learning techniques. In 2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 41\u201347."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.5555\/2039367"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126931"},{"key":"e_1_3_2_14_2","first-page":"106","volume-title":"Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing","author":"Li Ruipeng","year":"2020","unstructured":"Ruipeng Li and Chaoyu Zhang. 2020. Efficient parallel implementations of sparse triangular solves for GPU architectures. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 106\u2013117."},{"key":"e_1_3_2_15_2","first-page":"617","volume-title":"Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings 22","author":"Liu Weifeng","year":"2016","unstructured":"Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A synchronization-free algorithm for parallel sparse triangular solves. In Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016, Proceedings 22. Springer, 617\u2013630."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751209"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2015.06.010"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2015.04.004"},{"key":"e_1_3_2_19_2","volume-title":"Conference on Uncertainty in Artificial Intelligence (UAI)","volume":"20","author":"Low Yucheng","year":"2010","unstructured":"Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Vol. 20."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/s42514-023-00151-1"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404397.3404413"},{"key":"e_1_3_2_22_2","first-page":"1","volume-title":"2012 19th International Conference on High Performance Computing","author":"Matam Kiran","year":"2012","unstructured":"Kiran Matam, Siva Rama Krishna Bharadwaj Indarapu, and Kishore Kothapalli. 2012. Sparse matrix-matrix multiplication on modern architectures. In 2012 19th International Conference on High Performance Computing. IEEE, 1\u201310."},{"key":"e_1_3_2_23_2","first-page":"1024","volume-title":"Proceedings of the 2nd MIT Conference on Computational Fluid and Solid Mechanics","volume":"2","author":"McInnes L.","year":"2003","unstructured":"L. McInnes, B. Norris, Sanjukta Bhowmick, and P. Raghavan. 2003. Adaptive sparse linear solvers for implicit CFD using Newton-Krylov algorithms. In Proceedings of the 2nd MIT Conference on Computational Fluid and Solid Mechanics, Vol. 2. 1024\u20131028."},{"key":"e_1_3_2_24_2","volume-title":"GPU Technology Conference","author":"Naumov Maxim","year":"2010","unstructured":"Maxim Naumov, L. Chien, Philippe Vandermersch, and Ujval Kapasi. 2010. Cusparse library. In GPU Technology Conference."},{"key":"e_1_3_2_25_2","first-page":"124","volume-title":"International Supercomputing Conference","author":"Park Jongsoo","year":"2014","unstructured":"Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. Springer, 124\u2013140."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.5555\/829576"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1145\/1835698.1835791","volume-title":"Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing","author":"Sala Alessandra","year":"2010","unstructured":"Alessandra Sala, Haitao Zheng, Ben Y. Zhao, Sabrina Gaito, and Gian Paolo Rossi. 2010. Brief announcement: Revisiting the power-law degree distribution for social graph analysis. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. 400\u2013401."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.5555\/3037529.3037535"},{"key":"e_1_3_2_30_2","article-title":"Vectorizing the conjugate gradient method","author":"Schreiber Robert","year":"1982","unstructured":"Robert Schreiber and W. Tang. 1982. Vectorizing the conjugate gradient method. Symposium on CYBER 205 Applications (1982).","journal-title":"Symposium on CYBER 205 Applications"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404397.3404400"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2012.23"},{"key":"e_1_3_2_33_2","first-page":"480","volume-title":"SC\u201916: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Venkat Anand","year":"2016","unstructured":"Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In SC\u201916: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 480\u2013491."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926291"},{"key":"e_1_3_2_35_2","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1145\/3178487.3178513","volume-title":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"Wang Xinliang","year":"2018","unstructured":"Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 338\u2013353."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404397.3404428"},{"key":"e_1_3_2_37_2","article-title":"Graph transformation and specialized code generation for sparse triangular solve (SpTRSV)","author":"Yilmaz Buse","year":"2021","unstructured":"Buse Yilmaz. 2021. Graph transformation and specialized code generation for sparse triangular solve (SpTRSV). arXiv preprint arXiv:2103.11445 (2021).","journal-title":"arXiv preprint arXiv:2103.11445"},{"issue":"24","key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"e7761","DOI":"10.1002\/cpe.7761","article-title":"A novel graph transformation strategy for optimizing SpTRSV on CPUs","volume":"35","author":"Y\u0131lmaz Buse","year":"2023","unstructured":"Buse Y\u0131lmaz. 2023. A novel graph transformation strategy for optimizing SpTRSV on CPUs. Concurrency and Computation: Practice and Experience 35, 24 (2023), e7761.","journal-title":"Concurrency and Computation: Practice and Experience"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3066635"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3674911","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3674911","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:56Z","timestamp":1750291556000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3674911"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,19]]},"references-count":38,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3674911"],"URL":"https:\/\/doi.org\/10.1145\/3674911","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,19]]},"assertion":[{"value":"2024-01-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}