{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:12:13Z","timestamp":1750219933985,"version":"3.41.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T00:00:00Z","timestamp":1677628800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Special Project on High Performance Computing under the National Key R& D Program of China","award":["2020YFB0204601"],"award-info":[{"award-number":["2020YFB0204601"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            Matrix factorization functions are used in many areas and often play an important role in the overall performance of the applications. In the LAPACK library, matrix factorization functions are implemented with blocked factorization algorithm, shifting most of the workload to the high-performance Level-3 BLAS functions. But the non-blocked part, the panel factorization, becomes the performance bottleneck, especially for small- and medium-size matrices that are the common cases in many real applications. On the new Sunway many-core platform, the performance bottleneck of panel factorization can be alleviated by keeping the panel in the LDM for the panel factorization. Therefore, we propose a new framework for implementing matrix factorization functions on the new Sunway many-core platform, facilitating the in-LDM panel factorization. The framework provides a template class with wrapper functions, which integrates inter-CPE communication for the Level-1 and Level-2 BLAS functions with flexible interfaces and can accommodate different partitioning schemes. With the framework, writing panel factorization code with data residing in the LDM space can be done with much higher productivity. We implemented three functions (\n            <jats:italic>dgetrf<\/jats:italic>\n            ,\n            <jats:italic>dgeqrf<\/jats:italic>\n            , and\n            <jats:italic>dpotrf<\/jats:italic>\n            ) based on the framework and compared our work with a\n            <jats:italic>CPE_BLAS<\/jats:italic>\n            version, which uses the original LAPACK implementation linked with optimized BLAS library that runs on the CPE mesh. Using the most favorable partitioning, the panel factorization part achieves speedup of up to 26.3, 19.1, and 18.2 for the three matrix factorization functions. For the whole function, our implementation is based on a carefully tuned recursion framework, and we added specific optimization to some subroutines used in the factorization functions. Overall, we obtained average speedup of 9.76 on\n            <jats:italic>dgetrf<\/jats:italic>\n            , 10.12 on\n            <jats:italic>dgeqrf<\/jats:italic>\n            , and 4.16 on\n            <jats:italic>dpotrf<\/jats:italic>\n            , compared to the\n            <jats:italic>CPE_BLAS<\/jats:italic>\n            version. Based on the current template class, our work can be extended to support more categories of linear algebra functions.\n          <\/jats:p>","DOI":"10.1145\/3571856","type":"journal-article","created":{"date-parts":[[2022,11,19]],"date-time":"2022-11-19T10:20:16Z","timestamp":1668853216000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1795-4498","authenticated-orcid":false,"given":"Wenjing","family":"Ma","sequence":"first","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7344-7493","authenticated-orcid":false,"given":"Fangfang","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2823-7213","authenticated-orcid":false,"given":"Daokun","family":"Chen","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8105-147X","authenticated-orcid":false,"given":"Qinglin","family":"Lu","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4163-6817","authenticated-orcid":false,"given":"Yi","family":"Hu","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4827-7442","authenticated-orcid":false,"given":"Hongsen","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1375-6435","authenticated-orcid":false,"given":"Xinhui","family":"Yuan","sequence":"additional","affiliation":[{"name":"National Research Centre of Parallel Computer Engineering and Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2017.05.250"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2016.12.009"},{"key":"e_1_3_2_4_2","volume-title":"IEEE High Performance Extreme Computing Conference (HPEC\u201919)","author":"Abdelfattah Ahmad","year":"2019","unstructured":"Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra. 2019. Progressive optimization of batched LU factorization on GPUs. In IEEE High Performance Extreme Computing Conference (HPEC\u201919). IEEE, Waltham, MA."},{"key":"e_1_3_2_5_2","first-page":"60","volume-title":"International Conference on Computational Science","author":"Abdelfattah Ahmad","year":"2022","unstructured":"Ahmad Abdelfattah, Stan Tomov, and Jack Dongarra. 2022. Batch QR factorization on GPUs: Design, optimization, and tuning. In International Conference on Computational Science. Springer International Publishing, Cham, 60\u201374."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.90"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342020938421"},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"E. Anderson Z. Bai C. Bischof L. S. Blackford J. Demmel J. Dongarra J. Du Croz A. Greenbaum S. Hammarling A. McKenney and D. Sorensen. 1999. LAPACK User\u2019s Guide (3rd ed.). https:\/\/netlib.org\/lapack\/lug\/.","DOI":"10.1137\/1.9780898719604"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","DOI":"10.1137\/1.9780898719642","volume-title":"ScaLAPACK Users\u2019 Guide.","author":"Blackford L. S.","year":"1997","unstructured":"L. S. Blackford, J. Choi, A. Cleary, E. D\u2019Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users\u2019 Guide.https:\/\/netlib.org\/scalapack\/slug\/."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.10.002"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491491.2491492"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC.2014.30"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1093\/nsr\/nww044"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.62"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2132876.2132885"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3264491"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0377-0427(00)00400-3"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"Jack J. Dongarra and Stanimire Tomov. 2014. Matrix algebra for GPU and multicore architectures (MAGMA) for large petascale systems. Technical Report. https:\/\/www.osti.gov\/biblio\/1126489.","DOI":"10.2172\/1126489"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1007\/978-3-319-46079-6_37","volume-title":"International Conference on High Performance Computing","author":"Dorris Joseph","year":"2016","unstructured":"Joseph Dorris, Jakub Kurzak, Piotr Luszczek, Asim YarKhan, and Jack Dongarra. 2016. Task-based Cholesky decomposition on Knights Corner using OpenMP. In International Conference on High Performance Computing. Springer International Publishing, Cham, 544\u2013562."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1147\/rd.444.0605"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1007\/978-3-319-09967-5_14","volume-title":"International Workshop on Languages and Compilers for Parallel Computing","author":"Garcia Elkin","year":"2014","unstructured":"Elkin Garcia, Jaime Arteaga, Robert Pavel, and Guang R. Gao. 2014. Optimizing the LU factorization for energy efficiency on a many-core architecture. In International Workshop on Languages and Compilers for Parallel Computing. Springer International Publishing, Cham, 237\u2013251."},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Krassimir Georgiev and Jerzy Wasniewski. 2000. Recursive version of LU decomposition. In Revised Papers from the Second International Conference on Numerical Analysis and Its Applications (NAA\u201900) . Springer-Verlag Berlin Heidelberg 325\u2013332.","DOI":"10.1007\/3-540-45262-1_38"},{"key":"e_1_3_2_23_2","volume-title":"International Conference on High Performance Computing","author":"Haidar Azzam","year":"2015","unstructured":"Azzam Haidar, Tingxing Dong, Stanimire Tomov, Piotr Luszczek, and Jack Dongarra. 2015. Framework for batched and GPU-resident factorization algorithms to block householder transformations. In International Conference on High Performance Computing. Springer, Frankfurt, Germany."},{"key":"e_1_3_2_24_2","volume-title":"IEEE High Performance Extreme Computing Conference (HPEC\u201916)","author":"Haidar Azzam","year":"2016","unstructured":"Azzam Haidar, Stanimire Tomov, Konstantin Arturov, Murat Guney, Shane Story, and Jack Dongarra. 2016. LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi. In IEEE High Performance Extreme Computing Conference (HPEC\u201916). IEEE, Waltham, MA."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2015.7322444"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/SAAHPC.2011.18"},{"key":"e_1_3_2_27_2","unstructured":"National Supercomputing Center in Wuxi. 2016. xMath User Manual v1.0 (in Chinese). Retrieved from http:\/\/www.nsccwx.cn:1337\/uploads\/595bce0bed1b4537994d927ef6be922d."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3378176"},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1007\/11558958_3","volume-title":"Applied Parallel Computing. State of the Art in Scientific Computing","author":"K\u00e5gstr\u00f6m Bo","year":"2006","unstructured":"Bo K\u00e5gstr\u00f6m. 2006. Management of deep memory hierarchies\u2014Recursive blocked algorithms and hybrid data structures for dense matrix computations. In Applied Parallel Computing. State of the Art in Scientific Computing. Springer Berlin, 21\u201332."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.242"},{"key":"e_1_3_2_31_2","first-page":"28","volume-title":"International Conference on High Performance Computing for Computational Science","author":"Kurzak Jakub","year":"2013","unstructured":"Jakub Kurzak, Piotr Luszczek, Mathieu Faverge, and Jack Dongarra. 2013. Programming the LU factorization for a multicore system with accelerators. In International Conference on High Performance Computing for Computational Science. Springer Berlin, 28\u201335."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2017.06.005"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3487399"},{"key":"e_1_3_2_34_2","unstructured":"NVIDIA. 2017. NVIDIA TESLA V100 GPU ACCELERATOR. Retrieved from https:\/\/images.nvidia.cn\/content\/technologies\/volta\/pdf\/437317-Volta-V100-DS-NV-US-WEB.pdf."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061664"},{"key":"e_1_3_2_36_2","unstructured":"A. Petitet R. C. Whaley J. Dongarra and A. Cleary. 2018. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-memory Computers. Retrieved from http:\/\/www.netlib.org\/benchmark\/hpl\/."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476174"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2009.12.005"},{"key":"e_1_3_2_39_2","first-page":"813","volume-title":"European Conference on Parallel Processing","author":"Villa Oreste","year":"2013","unstructured":"Oreste Villa, Massimiliano Fatica, Nitin Gawande, and Antonino Tumeo. 2013. Power\/performance trade-offs of small batched LU based solvers on GPUs. In European Conference on Parallel Processing. Springer Berlin, 813\u2013825."},{"key":"e_1_3_2_40_2","volume-title":"International Conference on Computational Science","author":"Yamazaki Ichitaro","year":"2012","unstructured":"Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra. 2012. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In International Conference on Computational Science."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3571856","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3571856","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:48Z","timestamp":1750182528000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3571856"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":39,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3571856"],"URL":"https:\/\/doi.org\/10.1145\/3571856","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2022-07-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}