{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T12:29:50Z","timestamp":1773318590927,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":225,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,16]]},"DOI":"10.1145\/3712285.3759895","type":"proceedings-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T16:04:47Z","timestamp":1762963487000},"page":"1572-1589","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["KAMI: Communication-Avoiding General Matrix Multiplication within a Single GPU"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-9879-5951","authenticated-orcid":false,"given":"Hemeng","family":"Wang","sequence":"first","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-2799-2201","authenticated-orcid":false,"given":"Yang","family":"Du","sequence":"additional","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0175-2143","authenticated-orcid":false,"given":"Sidu","family":"Li","sequence":"additional","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1950-6481","authenticated-orcid":false,"given":"Xiaowen","family":"Tian","sequence":"additional","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2927-362X","authenticated-orcid":false,"given":"Qingxiao","family":"Sun","sequence":"additional","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2150-5759","authenticated-orcid":false,"given":"Weifeng","family":"Liu","sequence":"additional","affiliation":[{"name":"SSSLab, Dept. of CST, China University of Petroleum-Beijing, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,15]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Ahmad Abdelfattah Hartwig Anzt Erik\u00a0G Boman Erin Carson Terry Cojean Jack Dongarra Alyson Fox Mark Gates Nicholas\u00a0J Higham Xiaoye\u00a0S Li et\u00a0al. 2021. A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. The International Journal of High Performance Computing Applications 35 4 (2021) 344\u2013369.","DOI":"10.1177\/10943420211003313"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-58667-0_5"},{"key":"e_1_3_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems 28 8 (2017) 2258\u20132271.","DOI":"10.1109\/TPDS.2017.2656893"},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.125"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Hussam Al\u00a0Daas Grey Ballard Laura Grigori Suraj Kumar and Kathryn Rouse. 2024. Communication lower bounds and optimal algorithms for multiple tensor-times-matrix computation. SIAM J. Matrix Anal. Appl. 45 1 (2024) 450\u2013477.","DOI":"10.1137\/22M1510443"},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3558481.3591072"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Jos\u00e9\u00a0I Aliaga Hartwig Anzt Enrique\u00a0S Quintana-Ort\u00ed and Andr\u00e9s\u00a0E Tom\u00e1s. 2023. Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors. Concurrency and Computation: Practice and Experience 35 28 (2023) e7871.","DOI":"10.1002\/cpe.7871"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"crossref","unstructured":"Patrick Amestoy Olivier Boiteau Alfredo Buttari Matthieu Gerest Fabienne J\u00e9z\u00e9quel Jean-Yves L\u2019excellent and Theo Mary. 2024. Communication avoiding block low-rank parallel multifrontal triangular solve with many right-hand sides. SIAM J. Matrix Anal. Appl. 45 1 (2024) 148\u2013166.","DOI":"10.1137\/23M1568600"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.15"},{"key":"e_1_3_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.18"},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Hartwig Anzt Blake Haugen Jakub Kurzak Piotr Luszczek and Jack Dongarra. 2015. Experiences in autotuning matrix multiplication for energy minimization on GPUs. Concurrency and Computation: Practice and Experience 27 17 (2015) 5096\u20135113.","DOI":"10.1002\/cpe.3516"},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Hartwig Anzt Axel Huebl and Xiaoye\u00a0S. Li. 2024. Then and Now: Improving Software Portability Productivity and 100\u00d7 Performance. Computing in Science & Engineering 26 1 (2024) 61\u201370.","DOI":"10.1109\/MCSE.2024.3387302"},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Mochamad Asri Dhairya Malhotra Jiajun Wang George Biros Lizy\u00a0K. John and Andreas Gerstlauer. 2021. Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods. IEEE Transactions on Parallel and Distributed Systems 32 8 (2021) 2035\u20132048.","DOI":"10.1109\/TPDS.2021.3056045"},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Marc Baboulin Simplice Donfack Jack Dongarra Laura Grigori Adrien R\u00e9my and Stanimire Tomov. 2012. A class of communication-avoiding algorithms for solving general dense linear systems on CPU\/GPU parallel machines. Procedia Computer Science 9 (2012) 17\u201326.","DOI":"10.1016\/j.procs.2012.04.003"},{"key":"e_1_3_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard Dulceneia Becker James Demmel Jack Dongarra Alex Druinsky Inon Peled Oded Schwartz Sivan Toledo and Ichitaro Yamazaki. 2014. Communication-avoiding symmetric-indefinite factorization. SIAM J. Matrix Anal. Appl. 35 4 (2014) 1364\u20131406.","DOI":"10.1137\/130929060"},{"key":"e_1_3_3_2_17_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard Austin\u00a0R Benson Alex Druinsky Benjamin Lipshitz and Oded Schwartz. 2016. Improving the numerical stability of fast matrix multiplication. SIAM J. Matrix Anal. Appl. 37 4 (2016) 1382\u20131418.","DOI":"10.1137\/15M1032168"},{"key":"e_1_3_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.21236\/ADA580140"},{"key":"e_1_3_3_2_19_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard Erin Carson James Demmel Mark Hoemmen Nicholas Knight and Oded Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23 (2014) 1\u2013155.","DOI":"10.1017\/S0962492914000038"},{"key":"e_1_3_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3210377.3210415"},{"key":"e_1_3_3_2_21_2","volume-title":"ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)","author":"Ballard Grey","year":"2012","unstructured":"Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen\u2019s matrix multiplication. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)."},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard James Demmel Olga Holtz and Oded Schwartz. 2010. Communication-optimal Parallel and Sequential Cholesky Decomposition. SIAM Journal on Scientific Computing 32 6 (2010) 3495\u20133523.","DOI":"10.1137\/090760969"},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard James Demmel Olga Holtz and Oded Schwartz. 2011. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32 3 (2011) 866\u2013901.","DOI":"10.1137\/090769156"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard James Demmel Olga Holtz and Oded Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. J. ACM 59 6 (2012) 1\u201323.","DOI":"10.1145\/2395116.2395121"},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard James Demmel Olga Holtz and Oded Schwartz. 2014. Communication costs of Strassen\u2019s matrix multiplication. Commun. ACM 57 2 (2014) 107\u2013114.","DOI":"10.1145\/2556647.2556660"},{"key":"e_1_3_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145822"},{"key":"e_1_3_3_2_27_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard James Demmel and Nicholas Knight. 2015. Avoiding communication in successive band reduction. ACM Transactions on Parallel Computing 1 2 (2015) 1\u201337.","DOI":"10.1145\/2686877"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.21236\/ADA580196"},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Grey Ballard Christopher Siefert and Jonathan Hu. 2016. Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM Journal on Scientific Computing 38 3 (2016) C203\u2013C231.","DOI":"10.1137\/15M1028807"},{"key":"e_1_3_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2013.6799131"},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"crossref","unstructured":"O. Beaumont V. Boudet F. Rastello and Y. Robert. 2001. Matrix multiplication on heterogeneous platforms. IEEE Transactions on Parallel and Distributed Systems 12 10 (2001) 1033\u20131051.","DOI":"10.1109\/71.963416"},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-43659-3_13"},{"key":"e_1_3_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2600212.2600223"},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2688500.2688513"},{"key":"e_1_3_3_2_35_2","unstructured":"J\u00e9r\u00e9my Berthomieu Stef Graillat Dimitri Lesnoff and Theo Mary. 2025. Multiword matrix multiplication over large finite fields in floating-point arithmetic. HAL preprint hal-04917201 (2025)."},{"key":"e_1_3_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00118"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"crossref","unstructured":"Paolo Bientinesi John\u00a0A Gunnels Margaret\u00a0E Myers Enrique\u00a0S Quintana-Ort\u00ed and Robert A van\u00a0de Geijn. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Software 31 1 (2005) 1\u201326.","DOI":"10.1145\/1055531.1055532"},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/263580.263662"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976229.11"},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611975215.3"},{"key":"e_1_3_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640427"},{"key":"e_1_3_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650200.3656623"},{"key":"e_1_3_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650200.3656632"},{"key":"e_1_3_3_2_44_2","volume-title":"ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)","author":"Bulu\u00e7 Ayd\u0131n","year":"2009","unstructured":"Ayd\u0131n Bulu\u00e7, Jeremy\u00a0T Fineman, Matteo Frigo, John\u00a0R Gilbert, and Charles\u00a0E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)."},{"key":"e_1_3_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2004.40"},{"key":"e_1_3_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-61763-8_3"},{"key":"e_1_3_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00017"},{"key":"e_1_3_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.114"},{"key":"e_1_3_3_2_49_2","doi-asserted-by":"crossref","unstructured":"Erin Carson Nicholas Knight and James Demmel. 2013. Avoiding communication in nonsymmetric Lanczos-based Krylov subspace methods. SIAM Journal on Scientific Computing 35 5 (2013) S42\u2013S61.","DOI":"10.1137\/120881191"},{"key":"e_1_3_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607066"},{"key":"e_1_3_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345227"},{"key":"e_1_3_3_2_52_2","unstructured":"Lorenzo Chelini Henrik Barthels Paolo Bientinesi Marcin Copik Tobias Grosser and Daniele\u00a0G Spampinato. 2022. MOM: Matrix Operations in MLIR. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2208.10391 (2022)."},{"key":"e_1_3_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.56"},{"key":"e_1_3_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330355"},{"key":"e_1_3_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00055"},{"key":"e_1_3_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3605573.3605611"},{"key":"e_1_3_3_2_57_2","doi-asserted-by":"crossref","unstructured":"Jaeyoung Choi James Demmel Inderjiit Dhillon Jack Dongarra Susan Ostrouchov Antoine Petitet Ken Stanley David Walker and R\u00a0Clinton Whaley. 1996. ScaLAPACK: A portable linear algebra library for distributed memory computers\u2014Design issues and performance. Computer Physics Communications 97 1-2 (1996) 1\u201315.","DOI":"10.1016\/0010-4655(96)00017-3"},{"key":"e_1_3_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Jack Choquette. 2023. NVIDIA Hopper H100 GPU: Scaling Performance. IEEE Micro 43 3 (2023) 9\u201317.","DOI":"10.1109\/MM.2023.3256796"},{"key":"e_1_3_3_2_59_2","doi-asserted-by":"crossref","unstructured":"R. Clint Whaley Antoine Petitet and Jack\u00a0J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27 1 (2001) 3\u201335.","DOI":"10.1016\/S0167-8191(00)00087-9"},{"key":"e_1_3_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304604"},{"key":"e_1_3_3_2_61_2","doi-asserted-by":"crossref","unstructured":"Swapnil Das James Demmel Kimon Fountoulakis Laura Grigori Michael\u00a0W Mahoney and Shenghao Yang. 2021. Parallel and communication avoiding least angle regression. SIAM Journal on Scientific Computing 43 2 (2021) C154\u2013C176.","DOI":"10.1137\/19M1305720"},{"key":"e_1_3_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192404"},{"key":"e_1_3_3_2_63_2","doi-asserted-by":"crossref","unstructured":"Gunduz\u00a0Vehbi Demirci and Cevdet Aykanat. 2020. Cartesian partitioning models for 2d and 3d parallel spgemm algorithms. IEEE Transactions on Parallel and Distributed Systems 31 12 (2020) 2763\u20132775.","DOI":"10.1109\/TPDS.2020.3000708"},{"key":"e_1_3_3_2_64_2","doi-asserted-by":"crossref","unstructured":"James Demmel. 1991. LAPACK: A portable linear algebra library for high-performance computers. Concurrency: Practice and Experience 3 6 (1991) 655\u2013666.","DOI":"10.1002\/cpe.4330030610"},{"key":"e_1_3_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2011.7477498"},{"key":"e_1_3_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.Companion.2012.351"},{"key":"e_1_3_3_2_67_2","doi-asserted-by":"crossref","unstructured":"Jim Demmel Jack Dongarra Victor Eijkhout Erika Fuentes Antoine Petitet Rich Vuduc R\u00a0Clint Whaley and Katherine Yelick. 2005. Self-adapting linear algebra algorithms and software. Proc. IEEE 93 2 (2005) 293\u2013312.","DOI":"10.1109\/JPROC.2004.840848"},{"key":"e_1_3_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.80"},{"key":"e_1_3_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.32"},{"key":"e_1_3_3_2_70_2","doi-asserted-by":"crossref","unstructured":"James Demmel Laura Grigori Ming Gu and Hua Xiang. 2015. Communication avoiding rank revealing QR factorization with column pivoting. SIAM J. Matrix Anal. Appl. 36 1 (2015) 55\u201389.","DOI":"10.1137\/13092157X"},{"key":"e_1_3_3_2_71_2","doi-asserted-by":"crossref","unstructured":"James Demmel Laura Grigori Mark Hoemmen and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing 34 1 (2012) A206\u2013A239.","DOI":"10.1137\/080731992"},{"key":"e_1_3_3_2_72_2","doi-asserted-by":"crossref","unstructured":"James Demmel and Nicholas\u00a0J Higham. 1992. Stability of block algorithms with fast level-3 BLAS. ACM Trans. Math. Software 18 3 (1992) 274\u2013291.","DOI":"10.1145\/131766.131769"},{"key":"e_1_3_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2008.4536305"},{"key":"e_1_3_3_2_74_2","doi-asserted-by":"crossref","unstructured":"Aditya Devarakonda Kimon Fountoulakis James Demmel and Michael\u00a0W Mahoney. 2019. Avoiding communication in primal and dual block coordinate descent methods. SIAM Journal on Scientific Computing 41 1 (2019) C1\u2013C27.","DOI":"10.1137\/17M1134433"},{"key":"e_1_3_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470348"},{"key":"e_1_3_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-32820-6_55"},{"key":"e_1_3_3_2_77_2","doi-asserted-by":"crossref","unstructured":"Jack Dongarra Sven Hammarling Nicholas\u00a0J Higham Samuel\u00a0D Relton Pedro Valero-Lara and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Procedia Computer Science 108 (2017) 495\u2013504.","DOI":"10.1016\/j.procs.2017.05.138"},{"key":"e_1_3_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.108"},{"key":"e_1_3_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3472517"},{"key":"e_1_3_3_2_80_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476157"},{"key":"e_1_3_3_2_81_2","doi-asserted-by":"crossref","unstructured":"Oliver Fortmeier H\u00a0Martin B\u00fccker BO\u00a0Fagginger Auer and Rob\u00a0H Bisseling. 2013. A new metric enabling an exact hypergraph model for the communication volume in distributed-memory parallel applications. Parallel Comput. 39 8 (2013) 319\u2013335.","DOI":"10.1016\/j.parco.2013.05.003"},{"key":"e_1_3_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/263764.263789"},{"key":"e_1_3_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607050"},{"key":"e_1_3_3_2_84_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356154"},{"key":"e_1_3_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356223"},{"key":"e_1_3_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389132"},{"key":"e_1_3_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1145\/3210377.3210394"},{"key":"e_1_3_3_2_88_2","doi-asserted-by":"crossref","unstructured":"John\u00a0R. Gilbert Cleve Moler and Robert Schreiber. 1992. Sparse Matrices in MATLAB: Design and Implementation. SIAM J. Matrix Anal. Appl. 13 1 (1992) 333\u2013356.","DOI":"10.1137\/0613024"},{"key":"e_1_3_3_2_89_2","doi-asserted-by":"crossref","unstructured":"Kazushige Goto and Robert A van\u00a0de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Software 34 3 (2008) 1\u201325.","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_3_3_2_90_2","doi-asserted-by":"crossref","unstructured":"Kazushige Goto and Robert Van De\u00a0Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Software 35 1 (2008) 1\u201314.","DOI":"10.1145\/1377603.1377607"},{"key":"e_1_3_3_2_91_2","doi-asserted-by":"crossref","unstructured":"Laura Grigori Sebastien Cayrols and James\u00a0W Demmel. 2018. Low rank approximation of a sparse matrix based on LU factorization with column and row tournament pivoting. SIAM Journal on Scientific Computing 40 2 (2018) C181\u2013C209.","DOI":"10.1137\/16M1074527"},{"key":"e_1_3_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5214287"},{"key":"e_1_3_3_2_93_2","doi-asserted-by":"crossref","unstructured":"Laura Grigori James Demmel and Hua Xiang. 2011. CALU: a communication optimal LU factorization algorithm. SIAM J. Matrix Anal. Appl. 32 4 (2011) 1317\u20131350.","DOI":"10.1137\/100788926"},{"key":"e_1_3_3_2_94_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-07518-1_5"},{"key":"e_1_3_3_2_95_2","doi-asserted-by":"crossref","unstructured":"Laura Grigori Bernard Philippe Ahmed\u00a0H. Sameh Damien Tromeur-Dervout and Mari\u00e1n Vajtersic. 2008. Parallel matrix algorithms and applications. Parallel Comput. 34 6-8 (2008) 293\u2013295.","DOI":"10.1016\/j.parco.2008.05.001"},{"key":"e_1_3_3_2_96_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433722"},{"key":"e_1_3_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00050"},{"key":"e_1_3_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.1109\/SUPERC.1990.129999"},{"key":"e_1_3_3_2_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.83"},{"key":"e_1_3_3_2_100_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00053"},{"key":"e_1_3_3_2_101_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476177"},{"key":"e_1_3_3_2_102_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00076"},{"key":"e_1_3_3_2_103_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00033"},{"key":"e_1_3_3_2_104_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00018"},{"key":"e_1_3_3_2_105_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00020"},{"key":"e_1_3_3_2_106_2","doi-asserted-by":"crossref","unstructured":"Dror Irony Sivan Toledo and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel and Distrib. Comput. 64 9 (2004) 1017\u20131026.","DOI":"10.1016\/j.jpdc.2004.03.021"},{"key":"e_1_3_3_2_107_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627535.3638489"},{"key":"e_1_3_3_2_108_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2005.10"},{"key":"e_1_3_3_2_109_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225127"},{"key":"e_1_3_3_2_110_2","doi-asserted-by":"crossref","unstructured":"Enver Kayaaslan Cevdet Aykanat and Bora U\u00e7ar. 2018. 1.5 D parallel sparse matrix-vector multiply. SIAM Journal on Scientific Computing 40 1 (2018) C25\u2013C46.","DOI":"10.1137\/16M1105591"},{"key":"e_1_3_3_2_111_2","doi-asserted-by":"crossref","unstructured":"Amal Khabou James Demmel Laura Grigori and Ming Gu. 2013. LU factorization with panel rank revealing pivoting and its communication avoiding version. SIAM J. Matrix Anal. Appl. 34 3 (2013) 1401\u20131429.","DOI":"10.1137\/120863691"},{"key":"e_1_3_3_2_112_2","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126941"},{"key":"e_1_3_3_2_113_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.117"},{"key":"e_1_3_3_2_114_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.35"},{"key":"e_1_3_3_2_115_2","doi-asserted-by":"crossref","unstructured":"Thomas Koopman and Rob\u00a0H Bisseling. 2023. Minimizing communication in the multidimensional FFT. SIAM Journal on Scientific Computing 45 6 (2023) C330\u2013C347.","DOI":"10.1137\/22M1487242"},{"key":"e_1_3_3_2_116_2","doi-asserted-by":"publisher","DOI":"10.1145\/3337821.3337921"},{"key":"e_1_3_3_2_117_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476166"},{"key":"e_1_3_3_2_118_2","doi-asserted-by":"crossref","unstructured":"Jakub Kurzak Hartwig Anzt Mark Gates and Jack Dongarra. 2016. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs. IEEE Transactions on Parallel and Distributed Systems 27 7 (2016) 2036\u20132048.","DOI":"10.1109\/TPDS.2015.2481890"},{"key":"e_1_3_3_2_119_2","doi-asserted-by":"crossref","unstructured":"Jakub Kurzak Stanimire Tomov and Jack Dongarra. 2012. Autotuning GEMM Kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23 11 (2012) 2045\u20132057.","DOI":"10.1109\/TPDS.2011.311"},{"key":"e_1_3_3_2_120_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356181"},{"key":"e_1_3_3_2_121_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-57675-2_39"},{"key":"e_1_3_3_2_122_2","doi-asserted-by":"publisher","DOI":"10.1145\/582034.582089"},{"key":"e_1_3_3_2_123_2","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126931"},{"key":"e_1_3_3_2_124_2","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807671"},{"key":"e_1_3_3_2_125_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304626"},{"key":"e_1_3_3_2_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00022"},{"key":"e_1_3_3_2_127_2","doi-asserted-by":"publisher","DOI":"10.5555\/3571885.3571934"},{"key":"e_1_3_3_2_128_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295734"},{"key":"e_1_3_3_2_129_2","doi-asserted-by":"crossref","unstructured":"Xiaoye\u00a0S Li James Demmel David\u00a0H Bailey Greg Henry Yozo Hida Jimmy Iskandar William Kahan Suh\u00a0Y Kang Anil Kapur Michael\u00a0C Martin et\u00a0al. 2002. Design implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Software 28 2 (2002) 152\u2013205.","DOI":"10.1145\/567806.567808"},{"key":"e_1_3_3_2_130_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.33"},{"key":"e_1_3_3_2_131_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178529"},{"key":"e_1_3_3_2_132_2","doi-asserted-by":"crossref","unstructured":"Weifeng Liu and Brian Vinter. 2015. A framework for general sparse matrix\u2013matrix multiplication on GPUs and heterogeneous processors. J. Parallel and Distrib. Comput. 85 (2015) 47\u201361.","DOI":"10.1016\/j.jpdc.2015.06.010"},{"key":"e_1_3_3_2_133_2","volume-title":"International Conference on Parallel Processing (ICPP)","author":"L\u00f3pez Francisco","year":"2023","unstructured":"Francisco L\u00f3pez, Lars Karlsson, and Paolo Bientinesi. 2023. FLOPs as a Discriminant for Dense Linear Algebra Algorithms. In International Conference on Parallel Processing (ICPP)."},{"key":"e_1_3_3_2_134_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476225"},{"key":"e_1_3_3_2_135_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607051"},{"key":"e_1_3_3_2_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00058"},{"key":"e_1_3_3_2_137_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS57955.2024.00064"},{"key":"e_1_3_3_2_138_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74466-5_79"},{"key":"e_1_3_3_2_139_2","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807613"},{"key":"e_1_3_3_2_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00021"},{"key":"e_1_3_3_2_141_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC56025.2022.00042"},{"key":"e_1_3_3_2_142_2","doi-asserted-by":"publisher","DOI":"10.1145\/1654059.1654096"},{"key":"e_1_3_3_2_143_2","volume-title":"ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)","author":"Moran Yoav","year":"2023","unstructured":"Yoav Moran and Oded Schwartz. 2023. Multiplying 2 \u00d7 2 Sub-Blocks Using 4 Multiplications. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA)."},{"key":"e_1_3_3_2_144_2","volume-title":"A computer oriented geodetic data base and a new technique in file sequencing","author":"Morton Guy\u00a0M","year":"1966","unstructured":"Guy\u00a0M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company."},{"key":"e_1_3_3_2_145_2","doi-asserted-by":"publisher","DOI":"10.1145\/3673038.3673152"},{"key":"e_1_3_3_2_146_2","volume-title":"International Conference on Parallel Processing and Applied Mathematics (PPAM)","author":"Mukunoki Daichi","year":"2019","unstructured":"Daichi Mukunoki, Takeshi Ogita, and Katsuhisa Ozaki. 2019. Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures. In International Conference on Parallel Processing and Applied Mathematics (PPAM)."},{"key":"e_1_3_3_2_147_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-50743-5_12"},{"key":"e_1_3_3_2_148_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3472493"},{"key":"e_1_3_3_2_149_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063392"},{"key":"e_1_3_3_2_150_2","doi-asserted-by":"crossref","unstructured":"Rajib Nath Stanimire Tomov and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. The International Journal of High Performance Computing Applications 24 4 (2010) 511\u2013515.","DOI":"10.1177\/1094342010385729"},{"key":"e_1_3_3_2_151_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00058"},{"key":"e_1_3_3_2_152_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611977714.12"},{"key":"e_1_3_3_2_153_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626183.3659961"},{"key":"e_1_3_3_2_154_2","doi-asserted-by":"publisher","DOI":"10.1145\/3710848.3710859"},{"key":"e_1_3_3_2_155_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508431"},{"key":"e_1_3_3_2_156_2","unstructured":"NVIDIA. 2025. cuBLAS: Basic Linear Algebra on NVIDIA GPUs. Retrieved April 7 2025 from https:\/\/developer.nvidia.com\/cublas"},{"key":"e_1_3_3_2_157_2","unstructured":"NVIDIA. 2025. cuBLASDx: The cuBLAS Device Extensions. Retrieved April 7 2025 from https:\/\/docs.nvidia.com\/cuda\/cublasdx\/index.html"},{"key":"e_1_3_3_2_158_2","unstructured":"NVIDIA. 2025. CUTLASS: CUDA Templates for Linear Algebra Subroutines. Retrieved April 7 2025 from https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_3_2_159_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-32041-5_14"},{"key":"e_1_3_3_2_160_2","doi-asserted-by":"crossref","unstructured":"Hiroyuki Ootomo Katsuhisa Ozaki and Rio Yokota. 2024. DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications 38 4 (2024) 297\u2013313.","DOI":"10.1177\/10943420241239588"},{"key":"e_1_3_3_2_161_2","doi-asserted-by":"crossref","unstructured":"Elmar Peise and Paolo Bientinesi. 2019. The ELAPS framework: Experimental Linear Algebra Performance Studies. The International Journal of High Performance Computing Applications 33 2 (2019) 353\u2013365.","DOI":"10.1177\/1094342018763042"},{"key":"e_1_3_3_2_162_2","doi-asserted-by":"publisher","DOI":"10.1145\/3712285.3759898"},{"key":"e_1_3_3_2_163_2","doi-asserted-by":"crossref","unstructured":"Gregorio Quintana-Ort\u00ed Enrique\u00a0S Quintana-Ort\u00ed Robert A Van\u00a0De Geijn Field G\u00a0Van Zee and Ernie Chan. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Software 36 3 (2009) 1\u201326.","DOI":"10.1145\/1527286.1527288"},{"key":"e_1_3_3_2_164_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00052"},{"key":"e_1_3_3_2_165_2","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330357"},{"key":"e_1_3_3_2_166_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2018.00100"},{"key":"e_1_3_3_2_167_2","doi-asserted-by":"crossref","unstructured":"Piyush Sao Xiaoye\u00a0S Li and Richard Vuduc. 2019. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. J. Parallel and Distrib. Comput. 131 (2019) 218\u2013234.","DOI":"10.1016\/j.jpdc.2019.03.004"},{"key":"e_1_3_3_2_168_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447818.3461472"},{"key":"e_1_3_3_2_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS57527.2023.00023"},{"key":"e_1_3_3_2_170_2","volume-title":"Software Automatic Tuning: From Concepts to State-of-the-Art Results","author":"Shin Jaewook","year":"2010","unstructured":"Jaewook Shin, Mary\u00a0W Hall, Jacqueline Chame, Chun Chen, and Paul\u00a0D Hovland. 2010. Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In Software Automatic Tuning: From Concepts to State-of-the-Art Results. Springer New York."},{"key":"e_1_3_3_2_171_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_3_3_2_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/3087556.3087561"},{"key":"e_1_3_3_2_173_2","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126971"},{"key":"e_1_3_3_2_174_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063487"},{"key":"e_1_3_3_2_175_2","doi-asserted-by":"publisher","DOI":"10.21236\/ADA580350"},{"key":"e_1_3_3_2_176_2","doi-asserted-by":"publisher","DOI":"10.1145\/2612669.2612671"},{"key":"e_1_3_3_2_177_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23397-5_10"},{"key":"e_1_3_3_2_178_2","doi-asserted-by":"crossref","unstructured":"Edgar Solomonik James Demmel and Torsten Hoefler. 2021. Communication lower bounds of bilinear algorithms for symmetric tensor contractions. SIAM Journal on Scientific Computing 43 5 (2021) A3328\u2013A3356.","DOI":"10.1137\/20M1338599"},{"key":"e_1_3_3_2_179_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.48"},{"key":"e_1_3_3_2_180_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225131"},{"key":"e_1_3_3_2_181_2","doi-asserted-by":"crossref","unstructured":"Paul Springer and Paolo Bientinesi. 2018. Design of a High-Performance GEMM-like Tensor\u2013Tensor Multiplication. ACM Trans. Math. Software 44 3 (2018).","DOI":"10.1145\/3157733"},{"key":"e_1_3_3_2_182_2","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063431"},{"key":"e_1_3_3_2_183_2","doi-asserted-by":"crossref","unstructured":"Stanimire Tomov Jack Dongarra and Marc Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36 5-6 (2010) 232\u2013240.","DOI":"10.1016\/j.parco.2009.12.005"},{"key":"e_1_3_3_2_184_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433794"},{"key":"e_1_3_3_2_185_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607046"},{"key":"e_1_3_3_2_186_2","doi-asserted-by":"crossref","unstructured":"Yuki Uchino Katsuhisa Ozaki and Toshiyuki Imamura. 2025. Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit. The International Journal of High Performance Computing Applications 39 3 (2025) 462\u2013476.","DOI":"10.1177\/10943420241313064"},{"key":"e_1_3_3_2_187_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2019.00057"},{"key":"e_1_3_3_2_188_2","doi-asserted-by":"crossref","unstructured":"Didem Unat Anshu Dubey Torsten Hoefler John Shalf Mark Abraham Mauro Bianco Bradford\u00a0L Chamberlain Romain Cledat H\u00a0Carter Edwards Hal Finkel et\u00a0al. 2017. Trends in data locality abstractions for HPC systems. IEEE Transactions on Parallel and Distributed Systems 28 10 (2017) 3007\u20133020.","DOI":"10.1109\/TPDS.2017.2703149"},{"key":"e_1_3_3_2_189_2","doi-asserted-by":"crossref","unstructured":"Field\u00a0G Van\u00a0Zee. 2020. Implementing high-performance complex matrix multiplication via the 1m method. SIAM Journal on Scientific Computing 42 5 (2020) C221\u2013C244.","DOI":"10.1137\/19M1282040"},{"key":"e_1_3_3_2_190_2","doi-asserted-by":"crossref","unstructured":"Field\u00a0G Van\u00a0Zee Devangi\u00a0N Parikh and Robert A Van\u00a0De Geijn. 2021. Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans. Math. Software 47 2 (2021) 1\u201326.","DOI":"10.1145\/3402225"},{"key":"e_1_3_3_2_191_2","doi-asserted-by":"crossref","unstructured":"Field\u00a0G Van\u00a0Zee and Tyler\u00a0M Smith. 2017. Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans. Math. Software 44 1 (2017) 1\u201336.","DOI":"10.1145\/3086466"},{"key":"e_1_3_3_2_192_2","doi-asserted-by":"crossref","unstructured":"Field\u00a0G Van\u00a0Zee Tyler\u00a0M Smith Bryan Marker Tze\u00a0Meng Low Robert A Van\u00a0De Geijn Francisco\u00a0D Igual Mikhail Smelyanskiy Xianyi Zhang Michael Kistler Vernon Austel et\u00a0al. 2016. The BLIS framework: Experiments in portability. ACM Trans. Math. Software 42 2 (2016) 1\u201319.","DOI":"10.1145\/2755561"},{"key":"e_1_3_3_2_193_2","doi-asserted-by":"crossref","unstructured":"Field\u00a0G Van\u00a0Zee and Robert\u00a0A Van De\u00a0Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Software 41 3 (2015) 1\u201333.","DOI":"10.1145\/2764454"},{"key":"e_1_3_3_2_194_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5214359"},{"key":"e_1_3_3_2_195_2","doi-asserted-by":"crossref","unstructured":"Hemeng Wang Wenqing Lin Qingxiao Sun and Weifeng Liu. 2025. \u03bd GNN: Non-Uniformly partitioned full-graph GNN training on mixed GPUs. CCF Transactions on High Performance Computing (2025) 1\u201318.","DOI":"10.1007\/s42514-025-00224-3"},{"key":"e_1_3_3_2_196_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607107"},{"key":"e_1_3_3_2_197_2","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503219"},{"key":"e_1_3_3_2_198_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10247767"},{"key":"e_1_3_3_2_199_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC-DSS-SmartCity-DependSys57074.2022.00042"},{"key":"e_1_3_3_2_200_2","doi-asserted-by":"publisher","DOI":"10.1145\/3545008.3545032"},{"key":"e_1_3_3_2_201_2","doi-asserted-by":"crossref","unstructured":"Cunyang Wei Haipeng Jia Yunquan Zhang Jianyu Yao Chendi Li and Wenxuan Cao. 2024. IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs. IEEE Transactions on Parallel and Distributed Systems 35 9 (2024) 1672\u20131689.","DOI":"10.1109\/TPDS.2024.3432579"},{"key":"e_1_3_3_2_202_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.104"},{"key":"e_1_3_3_2_203_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00027"},{"key":"e_1_3_3_2_204_2","doi-asserted-by":"publisher","DOI":"10.1145\/3225058.3225140"},{"key":"e_1_3_3_2_205_2","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330354"},{"key":"e_1_3_3_2_206_2","doi-asserted-by":"crossref","unstructured":"Zhen Xie Guangming Tan Weifeng Liu and Ninghui Sun. 2021. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems 33 1 (2021) 159\u2013175.","DOI":"10.1109\/TPDS.2021.3090328"},{"key":"e_1_3_3_2_207_2","doi-asserted-by":"publisher","DOI":"10.1145\/3577193.3593707"},{"key":"e_1_3_3_2_208_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.81"},{"key":"e_1_3_3_2_209_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00019"},{"key":"e_1_3_3_2_210_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476217"},{"key":"e_1_3_3_2_211_2","doi-asserted-by":"crossref","unstructured":"Weiling Yang Jianbin Fang Dezun Dong Xing Su and Zheng Wang. 2024. Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs. IEEE Transactions on Parallel and Distributed Systems 35 3 (2024) 439\u2013454.","DOI":"10.1109\/TPDS.2024.3350368"},{"key":"e_1_3_3_2_212_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS53394.2021.00118"},{"key":"e_1_3_3_2_213_2","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126912"},{"key":"e_1_3_3_2_214_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.54"},{"key":"e_1_3_3_2_215_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2015.117"},{"key":"e_1_3_3_2_216_2","doi-asserted-by":"crossref","unstructured":"Yang You James Demmel Kent Czechowski Le Song and Rich Vuduc. 2016. Design and implementation of a communication-optimal classifier for distributed kernel support vector machines. IEEE Transactions on Parallel and Distributed Systems 28 4 (2016) 974\u2013988.","DOI":"10.1109\/TPDS.2016.2608823"},{"key":"e_1_3_3_2_217_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356137"},{"key":"e_1_3_3_2_218_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS57955.2024.00090"},{"key":"e_1_3_3_2_219_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.acl-long.1126"},{"key":"e_1_3_3_2_220_2","doi-asserted-by":"crossref","unstructured":"Albert-Jan\u00a0Nicholas Yzelman and Rob\u00a0H Bisseling. 2009. Cache-oblivious sparse matrix\u2013vector multiplication by using sparse matrix partitioning methods. SIAM Journal on Scientific Computing 31 4 (2009) 3128\u20133154.","DOI":"10.1137\/080733243"},{"key":"e_1_3_3_2_221_2","volume-title":"European Consortium of Mathematics in Industry (ECMI)","author":"Yzelman Albert-Jan\u00a0Nicholas","year":"2012","unstructured":"Albert-Jan\u00a0Nicholas Yzelman and Rob\u00a0H Bisseling. 2012. A cache-oblivious sparse matrix\u2013vector multiplication scheme based on the Hilbert curve. In European Consortium of Mathematics in Industry (ECMI). Springer."},{"key":"e_1_3_3_2_222_2","doi-asserted-by":"crossref","unstructured":"Albert-Jan\u00a0Nicholas Yzelman and Dirk Roose. 2013. High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 25 1 (2013) 116\u2013125.","DOI":"10.1109\/TPDS.2013.31"},{"key":"e_1_3_3_2_223_2","doi-asserted-by":"publisher","DOI":"10.1145\/3673038.3673108"},{"key":"e_1_3_3_2_224_2","unstructured":"Xianyi Zhang. 2016. OpenBLAS: An optimized BLAS library. Retrieved April 7 2025 from http:\/\/www.openmathlib.org\/OpenBLAS\/"},{"key":"e_1_3_3_2_225_2","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018755"},{"key":"e_1_3_3_2_226_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650200.3656593"}],"event":{"name":"SC '25: The International Conference for High Performance Computing, Networking, Storage and Analysis","location":"St. Louis MO USA","acronym":"SC '25","sponsor":["SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712285.3759895","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:45:58Z","timestamp":1773254758000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712285.3759895"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,15]]},"references-count":225,"alternative-id":["10.1145\/3712285.3759895","10.1145\/3712285"],"URL":"https:\/\/doi.org\/10.1145\/3712285.3759895","relation":{},"subject":[],"published":{"date-parts":[[2025,11,15]]},"assertion":[{"value":"2025-11-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}