{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T22:42:18Z","timestamp":1769640138226,"version":"3.49.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T00:00:00Z","timestamp":1731024000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004543","name":"China Scholarship Council","doi-asserted-by":"crossref","award":["202106380059"],"award-info":[{"award-number":["202106380059"]}],"id":[{"id":"10.13039\/501100004543","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            In this article, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical background and discuss our implementation strategies. Our study focuses on two common programming approaches for NVIDIA Tensor Cores: the C++ Warp Matrix Functions in\n            <jats:monospace>nvcuda::wmma<\/jats:monospace>\n            and the inline Parallel Thread Execution (PTX) instructions\n            <jats:monospace>mma.sync.aligned<\/jats:monospace>\n            . A significant focus is placed on the adoption of the versatile inline PTX instructions combined with a conflict-free shared memory access pattern, a key to unlocking superior performance. When benchmarked against traditional CUDA Cores, our approach yields a remarkable 2.3-fold increase in double-precision performance, achieving 8 TFLOPS\/s\u201445% of the theoretical maximum. Furthermore, in half-precision computations, numerical experiments demonstrate a fourfold enhancement in solving the Poisson equation using the flexible GMRES (FGMRES) method, preconditioned by a multigrid method in 3D. This is achieved while maintaining the same discretization error as observed in double-precision computations. These results highlight the considerable benefits of using Tensor Cores for finite element operators with tensor products, achieving an optimal balance between computational speed and precision.\n          <\/jats:p>","DOI":"10.1145\/3695466","type":"journal-article","created":{"date-parts":[[2024,9,9]],"date-time":"2024-09-09T11:01:41Z","timestamp":1725879701000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Acceleration of Tensor-Product Operations with Tensor Cores"],"prefix":"10.1145","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0341-4447","authenticated-orcid":false,"given":"Cu","family":"Cui","sequence":"first","affiliation":[{"name":"IWR, Heidelberg University, Heidelberg, Germany"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_3_4_2_2","doi-asserted-by":"crossref","first-page":"102841","DOI":"10.1016\/j.parco.2021.102841","article-title":"GPU algorithms for efficient exascale discretizations","volume":"108","author":"Abdelfattah Ahmad","year":"2021","unstructured":"Ahmad Abdelfattah, Valeria Barra, Natalie Beams, Ryan Bleile, Jed Brown, Jean-Sylvain Camier, Robert Carson, Noel Chalmers, Veselin Dobrev, Yohann Dudouit, Paul Fischer, Ali Karakus, Stefan Kerkemeier, Tzanio Kolev, Yu-Hsiang Lan, Elia Merzari, Misun Min, Malachi Phillips, Thilina Rathnayake, Robert Rieben, Thomas Stitt, Ananias Tomboulides, Stanimire Tomov, Vladimir Tomov, Arturo Vargas, Tim Warburton, and Kenneth Weiss. 2021. GPU algorithms for efficient exascale discretizations. Parallel Comput. 108 (2021), 102841.","journal-title":"Parallel Comput."},{"issue":"4","key":"e_1_3_4_3_2","doi-asserted-by":"crossref","first-page":"742","DOI":"10.1137\/0719052","article-title":"An interior penalty finite element method with discontinuous elements","volume":"19","author":"Arnold Douglas N.","year":"1982","unstructured":"Douglas N. Arnold. 1982. An interior penalty finite element method with discontinuous elements. SIAM J. Numer. Anal. 19, 4 (1982), 742\u2013760.","journal-title":"SIAM J. Numer. Anal."},{"issue":"5","key":"e_1_3_4_4_2","doi-asserted-by":"crossref","first-page":"1749","DOI":"10.1137\/S0036142901384162","article-title":"Unified analysis of discontinuous Galerkin methods for elliptic problems","volume":"39","author":"Arnold Douglas N.","year":"2002","unstructured":"Douglas N. Arnold, Franco Brezzi, Bernardo Cockburn, and L. Donatella Marini. 2002. Unified analysis of discontinuous Galerkin methods for elliptic problems. SIAM J. Numer. Anal. 39, 5 (2002), 1749\u20131779.","journal-title":"SIAM J. Numer. Anal."},{"key":"e_1_3_4_5_2","series-title":"(Pitman Research Notes in Mathematics Series, Vol. 294)","volume-title":"Multigrid Methods","author":"Bramble James H.","year":"1993","unstructured":"James H. Bramble. 1993. Multigrid Methods. (Pitman Research Notes in Mathematics Series, Vol. 294). Longman Scientific, New York."},{"key":"e_1_3_4_6_2","doi-asserted-by":"crossref","DOI":"10.1137\/1.9781611970753","volume-title":"Multigrid Techniques","author":"Brandt Achi","year":"2011","unstructured":"Achi Brandt and Oren E. Livne. 2011. Multigrid Techniques. Society for Industrial and Applied Mathematics."},{"issue":"1","key":"e_1_3_4_7_2","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1145\/225545.225548","article-title":"Efficient vector and parallel manipulation of tensor products","volume":"22","author":"Buis Paul E.","year":"1996","unstructured":"Paul E. Buis and Wayne R. Dyksen. 1996. Efficient vector and parallel manipulation of tensor products. ACM Trans. Math. Softw. 22, 1 (1996), 18\u201323.","journal-title":"ACM Trans. Math. Softw."},{"issue":"1","key":"e_1_3_4_8_2","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1016\/j.compfluid.2010.08.012","article-title":"From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements","volume":"43","author":"Cantwell Chris D.","year":"2011","unstructured":"Chris D. Cantwell, Spencer J. Sherwin, Robert M. Kirby, and Paul H. J. Kelly. 2011. From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements. Comput. Fluids 43, 1 (2011), 23\u201328.","journal-title":"Comput. Fluids"},{"key":"e_1_3_4_9_2","unstructured":"Cu Cui Paul Grosse-Bley Guido Kanschat and Robert Strzodka. 2024. An implementation of tensor product patch smoothers on GPU. arxiv:2405.19004"},{"key":"e_1_3_4_10_2","unstructured":"Cu Cui and Guido Kanschat. 2024. Multilevel Interior Penalty Methods on GPUs. arxiv:2405.18982"},{"key":"e_1_3_4_11_2","doi-asserted-by":"crossref","first-page":"46","DOI":"10.1145\/3330345.3331057","volume-title":"Proceedings of the ACM International Conference on Supercomputing (ICS\u201919)","author":"Dakkak Abdul","year":"2019","unstructured":"Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-Mei Hwu. 2019. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing (ICS\u201919). Association for Computing Machinery, New York, NY, USA, 46\u201357."},{"issue":"7","key":"e_1_3_4_12_2","doi-asserted-by":"crossref","first-page":"4255","DOI":"10.1021\/acs.jctc.2c00274","article-title":"Quantum perturbation theory using tensor cores and a deep neural network","volume":"18","author":"Finkelstein Joshua","year":"2022","unstructured":"Joshua Finkelstein, Emanuel H. Rubensson, Susan M. Mniszewski, Christian F. A. Negre, and Anders M. N. Niklasson. 2022. Quantum perturbation theory using tensor cores and a deep neural network. J. Chem. Theor. Comput. 18, 7 (2022), 4255\u20134268.","journal-title":"J. Chem. Theor. Comput."},{"key":"e_1_3_4_13_2","doi-asserted-by":"crossref","first-page":"104541","DOI":"10.1016\/j.compfluid.2020.104541","article-title":"High-order matrix-free incompressible flow solvers with GPU acceleration and low-order refined preconditioners","volume":"203","author":"Franco Michael","year":"2020","unstructured":"Michael Franco, Jean-Sylvain Camier, Julian Andrej, and Will Pazner. 2020. High-order matrix-free incompressible flow solvers with GPU acceleration and low-order refined preconditioners. Comput. Fluids 203 (2020), 104541.","journal-title":"Comput. Fluids"},{"key":"e_1_3_4_14_2","first-page":"135","volume-title":"Proceedings of the IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC\u201922)","author":"Gallet Benoit","year":"2022","unstructured":"Benoit Gallet and Michael Gowanlock. 2022. Leveraging GPU tensor cores for double precision Euclidean distance calculations. In Proceedings of the IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC\u201922). 135\u2013144."},{"key":"e_1_3_4_15_2","first-page":"12","volume-title":"Proceedings of the International Conference on High Performance Computing & Simulation","author":"Goddeke Dominik","year":"2009","unstructured":"Dominik Goddeke, Sven H. M. Buijssen, Hilmar Wobker, and Stefan Turek. 2009. GPU acceleration of an unmodified parallel finite element Navier-Stokes solver. In Proceedings of the International Conference on High Performance Computing & Simulation. IEEE, 12\u201321."},{"issue":"10","key":"e_1_3_4_16_2","doi-asserted-by":"crossref","first-page":"685","DOI":"10.1016\/j.parco.2007.09.002","article-title":"Exploring weak scalability for FEM calculations on a GPU-enhanced cluster","volume":"33","author":"G\u00f6ddeke Dominik","year":"2007","unstructured":"Dominik G\u00f6ddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick McCormick, Sven H. M. Buijssen, Matthias Grajewski, and Stefan Turek. 2007. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput. 33, 10-11 (2007), 685\u2013699.","journal-title":"Parallel Comput."},{"key":"e_1_3_4_17_2","volume-title":"Accelerating Double Precision FEM Simulations with GPUs","author":"G\u00f6ddeke Dominik","year":"2005","unstructured":"Dominik G\u00f6ddeke, Robert Strzodka, and Stefan Turek. 2005. Accelerating Double Precision FEM Simulations with GPUs. Heidelberg University."},{"issue":"4","key":"e_1_3_4_18_2","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1080\/17445760601122076","article-title":"Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations","volume":"22","author":"G\u00f6ddeke Dominik","year":"2007","unstructured":"Dominik G\u00f6ddeke, Robert Strzodka, and Stefan Turek. 2007. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. Int. J. Parallel, Emerg. Distrib. Syst. 22, 4 (2007), 221\u2013256.","journal-title":"Int. J. Parallel, Emerg. Distrib. Syst."},{"key":"e_1_3_4_19_2","first-page":"1737","volume-title":"Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915)","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915). JMLR.org, 1737\u20131746."},{"key":"e_1_3_4_20_2","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-662-02427-0","volume-title":"Multi-grid Methods and Applications","author":"Hackbusch Wolfgang","year":"1985","unstructured":"Wolfgang Hackbusch. 1985. Multi-grid Methods and Applications. Springer."},{"key":"e_1_3_4_21_2","first-page":"603","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201918)","author":"Haidar Azzam","year":"2018","unstructured":"Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201918). IEEE, 603\u2013613."},{"key":"e_1_3_4_22_2","volume-title":"Proceedings of the 50th International Conference on Parallel Processing (ICPP\u201921)","author":"Ji Zhuoran","year":"2021","unstructured":"Zhuoran Ji and Cho-Li Wang. 2021. Accelerating DBSCAN algorithm with AI chips for large datasets. In Proceedings of the 50th International Conference on Parallel Processing (ICPP\u201921). Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_3_4_23_2","volume-title":"Proceedings of the 36th ACM International Conference on Supercomputing (ICS\u201922)","author":"Ji Zhuoran","year":"2022","unstructured":"Zhuoran Ji and Cho-Li Wang. 2022. Efficient exact k-nearest neighbor graph construction for billion-scale datasets using GPUs with tensor cores. In Proceedings of the 36th ACM International Conference on Supercomputing (ICS\u201922). Association for Computing Machinery, New York, NY, USA."},{"issue":"1","key":"e_1_3_4_24_2","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1145\/363707.363723","article-title":"Pracniques: Further remarks on reducing truncation errors","volume":"8","author":"Kahan William","year":"1965","unstructured":"William Kahan. 1965. Pracniques: Further remarks on reducing truncation errors. Commun. ACM 8, 1 (1965), 40.","journal-title":"Commun. ACM"},{"key":"e_1_3_4_25_2","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1016\/j.jcp.2019.04.010","article-title":"A GPU accelerated discontinuous Galerkin incompressible flow solver","volume":"390","author":"Karakus A.","year":"2019","unstructured":"A. Karakus, N. Chalmers, K. \u015awirydowicz, and T. Warburton. 2019. A GPU accelerated discontinuous Galerkin incompressible flow solver. J. Comput. Phys. 390 (2019), 380\u2013404.","journal-title":"J. Comput. Phys."},{"issue":"21","key":"e_1_3_4_26_2","doi-asserted-by":"crossref","first-page":"7863","DOI":"10.1016\/j.jcp.2009.06.041","article-title":"Nodal discontinuous Galerkin methods on graphics processors","volume":"228","author":"Kl\u00f6ckner Andreas","year":"2009","unstructured":"Andreas Kl\u00f6ckner, Tim Warburton, Jeff Bridge, and Jan S. Hesthaven. 2009. Nodal discontinuous Galerkin methods on graphics processors. J. Comput. Phys. 228, 21 (2009), 7863\u20137882.","journal-title":"J. Comput. Phys."},{"issue":"6","key":"e_1_3_4_27_2","doi-asserted-by":"crossref","first-page":"527","DOI":"10.1177\/10943420211020803","article-title":"Efficient exascale discretizations: High-order finite element methods","volume":"35","author":"Kolev Tzanio","year":"2021","unstructured":"Tzanio Kolev, Paul Fischer, Misun Min, Jack Dongarra, Jed Brown, Veselin Dobrev, Tim Warburton, Stanimire Tomov, Mark S. Shephard, Ahmad Abdelfattah, Valeria Barra, Natalie Beams, Jean-Sylvain Camier, Noel Chalmers, Yohann Dudouit, Ali Karakus, Ian Karlin, Stefan Kerkemeier, Yu-Hsiang Lan, David Medina, Elia Merzari, Aleksandr Obabko, Will Pazner, Thilina Rathnayake, Cameron W. Smith, Lukas Spies, Kasia Swirydowicz, Jeremy Thompson, Ananias Tomboulides, and Vladimir Tomov. 2021. Efficient exascale discretizations: High-order finite element methods. Int. J. High Perform. Comput. Applic. 35, 6 (2021), 527\u2013552.","journal-title":"Int. J. High Perform. Comput. Applic."},{"key":"e_1_3_4_28_2","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1016\/j.compfluid.2012.04.012","article-title":"A generic interface for parallel cell-based finite element operator application","volume":"63","author":"Kronbichler Martin","year":"2012","unstructured":"Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-based finite element operator application. Comput. Fluids 63 (2012), 135\u2013147.","journal-title":"Comput. Fluids"},{"issue":"3","key":"e_1_3_4_29_2","first-page":"29","article-title":"Fast matrix-free evaluation of discontinuous Galerkin finite element operators","volume":"45","author":"Kronbichler Martin","year":"2019","unstructured":"Martin Kronbichler and Katharina Kormann. 2019. Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM Trans. Math. Softw. 45, 3 (2019), 29.","journal-title":"ACM Trans. Math. Softw."},{"issue":"1","key":"e_1_3_4_30_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3322813","article-title":"Multigrid for matrix-free high-order finite element computations on graphics processors","volume":"6","author":"Kronbichler Martin","year":"2019","unstructured":"Martin Kronbichler and Karl Ljungkvist. 2019. Multigrid for matrix-free high-order finite element computations on graphics processors. ACM Trans. Parallel Comput. 6, 1 (2019), 1\u201332.","journal-title":"ACM Trans. Parallel Comput."},{"key":"e_1_3_4_31_2","article-title":"tcFFT: Accelerating half-precision FFT through tensor cores","author":"Li Binrui","year":"2021","unstructured":"Binrui Li, Shenggan Cheng, and James Lin. 2021. tcFFT: Accelerating half-precision FFT through tensor cores. arXiv preprint arXiv:2104.11471 (2021).","journal-title":"arXiv preprint arXiv:2104.11471"},{"key":"e_1_3_4_32_2","volume-title":"Proceedings of the 25th High Performance Computing Symposium (HPC\u201917)","author":"Ljungkvist Karl","year":"2017","unstructured":"Karl Ljungkvist. 2017. Matrix-free finite-element computations on graphics processors with adaptively refined unstructured meshes. In Proceedings of the 25th High Performance Computing Symposium (HPC\u201917). Society for Computer Simulation International."},{"key":"e_1_3_4_33_2","first-page":"522","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Markidis S.","year":"2018","unstructured":"S. Markidis, S. Chien, E. Laure, I. Peng, and J. S. Vetter. 2018. NVIDIA tensor core programmability, performance & precision. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). IEEE Computer Society, Los Alamitos, CA, USA, 522\u2013531."},{"key":"e_1_3_4_34_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201923)","author":"Merzari Elia","year":"2023","unstructured":"Elia Merzari, Steven Hamilton, Thomas Evans, Misun Min, Paul Fischer, Stefan Kerkemeier, Jun Fang, Paul Romano, Yu-Hsiang Lan, Malachi Phillips, Elliott Biondo, Katherine Royston, Tim Warburton, Noel Chalmers, and Thilina Rathnayake. 2023. Exascale multiphysics nuclear reactor simulations for advanced designs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201923). Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_3_4_35_2","unstructured":"NVIDIA Corporation. 2020. Nvidia Ampere architecture white paper. Retrieved from https:\/\/resources.nvidia.com\/en-us-genomicsep\/ampere-architecture-white-paper"},{"key":"e_1_3_4_36_2","unstructured":"NVIDIA Corporation. 2020. Nvidia Volta architecture white paper. Retrieved from https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf"},{"key":"e_1_3_4_37_2","unstructured":"NVIDIA Corporation. 2022. cuBlas. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cublas\/index.htm"},{"key":"e_1_3_4_38_2","unstructured":"NVIDIA Corporation. 2023. Nsight Compute. Retrieved from https:\/\/docs.nvidia.com\/nsight-compute\/index.html"},{"key":"e_1_3_4_39_2","unstructured":"NVIDIA Corporation. 2023. Parallel Thread Execution ISA. Retrieved from https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html"},{"key":"e_1_3_4_40_2","first-page":"1","volume-title":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia\u201923)","author":"Ootomo Hiroyuki","year":"2023","unstructured":"Hiroyuki Ootomo and Rio Yokota. 2023. Reducing shared memory footprint to leverage high throughput on tensor cores and its flexible API extension library. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia\u201923). Association for Computing Machinery, New York, NY, USA, 1\u20138."},{"issue":"1","key":"e_1_3_4_41_2","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1145\/3200691.3178500","article-title":"Register optimizations for stencils on GPUs","volume":"53","author":"Rawat Prashant Singh","year":"2018","unstructured":"Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-No\u00ebl Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. SIGPLAN Not. 53, 1 (Feb.2018), 168\u2013182.","journal-title":"SIGPLAN Not."},{"key":"e_1_3_4_42_2","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1016\/j.jcp.2016.08.005","article-title":"GPU accelerated spectral finite elements on all-hex meshes","volume":"324","author":"Remacle J.-F.","year":"2016","unstructured":"J.-F. Remacle, Rajesh Gandham, and Tim Warburton. 2016. GPU accelerated spectral finite elements on all-hex meshes. J. Comput. Phys. 324 (2016), 246\u2013257.","journal-title":"J. Comput. Phys."},{"issue":"4","key":"e_1_3_4_43_2","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1177\/10943420221084657","article-title":"Very fast finite element poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100","volume":"36","author":"Ruda Dustin","year":"2022","unstructured":"Dustin Ruda, Stefan Turek, Dirk Ribbrock, and Peter Zajac. 2022. Very fast finite element poisson solvers on lower precision accelerator hardware: A proof of concept study for Nvidia Tesla V100. Int. J. High Perform. Comput. Applic. 36, 4 (2022), 459\u2013474.","journal-title":"Int. J. High Perform. Comput. Applic."},{"issue":"2","key":"e_1_3_4_44_2","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1137\/0914028","article-title":"A flexible inner-outer preconditioned GMRES algorithm","volume":"14","author":"Saad Youcef","year":"1993","unstructured":"Youcef Saad. 1993. A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Scient. Comput. 14, 2 (1993), 461\u2013469.","journal-title":"SIAM J. Scient. Comput."},{"key":"e_1_3_4_45_2","unstructured":"Erich Strohmaier Jack Dongarra Horst Simon and Martin Meuer. 2023. June 2023 | TOP500. Retrieved from https:\/\/www.top500.org\/lists\/top500\/2023\/06\/"},{"issue":"1","key":"e_1_3_4_46_2","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1109\/TPDS.2022.3217824","article-title":"Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors","volume":"34","author":"Sun Wei","year":"2022","unstructured":"Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2022. Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors. IEEE Trans. Parallel Distrib. Syst. 34, 1 (2022), 246\u2013261.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"issue":"4","key":"e_1_3_4_47_2","doi-asserted-by":"crossref","first-page":"735","DOI":"10.1177\/1094342018816368","article-title":"Acceleration of tensor-product operations for high-order finite element methods","volume":"33","author":"\u015awirydowicz Kasia","year":"2019","unstructured":"Kasia \u015awirydowicz, Noel Chalmers, Ali Karakus, and Tim Warburton. 2019. Acceleration of tensor-product operations for high-order finite element methods. Int. J. High Perform. Comput. Applic. 33, 4 (2019), 735\u2013757.","journal-title":"Int. J. High Perform. Comput. Applic."},{"key":"e_1_3_4_48_2","volume-title":"CUTLASS","author":"Thakkar Vijay","year":"2023","unstructured":"Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. Retrieved from https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_4_49_2","first-page":"514","volume-title":"27th European Symposium on Research in Computer Security (ESORICS\u201922)","author":"Wan Lipeng","year":"2022","unstructured":"Lipeng Wan, Fangyu Zheng, Guang Fan, Rong Wei, Lili Gao, Yuewu Wang, Jingqiang Lin, and Jiankuo Dong. 2022. A novel high-performance implementation of CRYSTALS-Kyber with AI accelerator. In 27th European Symposium on Research in Computer Security (ESORICS\u201922). Springer-Verlag, Berlin, 514\u2013534."},{"issue":"3","key":"e_1_3_4_50_2","doi-asserted-by":"crossref","first-page":"709","DOI":"10.1515\/cmam-2020-0078","article-title":"Fast tensor product Schwarz smoothers for high-order discontinuous Galerkin methods","volume":"21","author":"Witte Julius","year":"2021","unstructured":"Julius Witte, Daniel Arndt, and Guido Kanschat. 2021. Fast tensor product Schwarz smoothers for high-order discontinuous Galerkin methods. Comput. Meth. Appl. Math. 21, 3 (2021), 709\u2013728.","journal-title":"Comput. Meth. Appl. Math."}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695466","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695466","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:58:11Z","timestamp":1750294691000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695466"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,8]]},"references-count":49,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3695466"],"URL":"https:\/\/doi.org\/10.1145\/3695466","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,8]]},"assertion":[{"value":"2024-04-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}