{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,18]],"date-time":"2026-04-18T01:47:02Z","timestamp":1776476822368,"version":"3.51.2"},"publisher-location":"New York, NY, USA","reference-count":31,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,1,26]]},"DOI":"10.1145\/3773656.3773670","type":"proceedings-article","created":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T10:22:11Z","timestamp":1767954131000},"page":"91-101","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8444-6303","authenticated-orcid":false,"given":"Angelika","family":"Schwarz","sequence":"first","affiliation":[{"name":"NVIDIA Corporation, Stockholm, Sweden"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6491-090X","authenticated-orcid":false,"given":"Anton","family":"Anders","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Santa Clara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6520-8717","authenticated-orcid":false,"given":"Cole","family":"Brower","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Santa Clara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8971-1899","authenticated-orcid":false,"given":"Harun","family":"Bayraktar","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Santa Clara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5110-190X","authenticated-orcid":false,"given":"John","family":"Gunnels","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, New York, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5211-2002","authenticated-orcid":false,"given":"Kate","family":"Clark","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Cambridge, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5782-716X","authenticated-orcid":false,"given":"RuQing G.","family":"Xu","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Minato, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2406-6499","authenticated-orcid":false,"given":"Samuel","family":"Rodriguez","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Santa Clara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3740-8985","authenticated-orcid":false,"given":"Sebastien","family":"Cayrols","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Knoxville, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5559-0444","authenticated-orcid":false,"given":"Pawel","family":"Tabaszewski","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Warsaw, Poland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1417-547X","authenticated-orcid":false,"given":"Victor","family":"Podlozhnyuk","sequence":"additional","affiliation":[{"name":"NVIDIA Corporation, Reading, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2026,1,25]]},"reference":[{"key":"e_1_3_3_1_2_2","doi-asserted-by":"publisher","unstructured":"Ahmad Abdelfattah Jack Dongarra Massimiliano Fasi Mantas Mikaitis and Fran\u00e7oise Tisseur. 2025. Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic. arXiv preprint (2025). 10.48550\/arXiv.2506.11277 arxiv:https:\/\/arXiv.org\/abs\/2506.11277","DOI":"10.48550\/arXiv.2506.11277"},{"key":"e_1_3_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.5555\/323215"},{"key":"e_1_3_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.2172\/2337606"},{"key":"e_1_3_3_1_5_2","doi-asserted-by":"publisher","unstructured":"William Dawson Katsuhisa Ozaki Jens Domke and Takahito Nakajima. 2024. Reducing Numerical Precision Requirements in Quantum Chemistry Calculations. Journal of Chemical Theory and Computation 20 24 (2024) 10826\u201310837. 10.1021\/acs.jctc.4c00938","DOI":"10.1021\/acs.jctc.4c00938"},{"key":"e_1_3_3_1_6_2","unstructured":"Jim Demmel et\u00a0al. 2025. More aggressive (sparse) BLAS testing to identify aggressive optimizations. Private communication. Unpublished manuscript referenced with author approval. Citation details will be updated once published.."},{"key":"e_1_3_3_1_7_2","unstructured":"Jim Demmel Xiaoye Li Julien Langou Weslley Pereira Mark Gates and Cindy\u00a0Rubio Gonzalez. 2024. How to grade the accuracy of an implementation of the BLAS. https:\/\/www.cs.utexas.edu\/\u00a0flame\/BLISRetreat2024\/slides\/Grading_BLAS.pdf"},{"key":"e_1_3_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00114"},{"key":"e_1_3_3_1_9_2","doi-asserted-by":"publisher","unstructured":"Jack Dongarra John Gunnels Harun Bayraktar Azzam Haidar and Dan Ernst. 2024. Hardware Trends Impacting Floating-Point Computations In Scientific Applications. (Nov. 2024). 10.48550\/arXiv.2411.12090 arxiv:https:\/\/arXiv.org\/abs\/2411.12090\u00a0[math.NA]","DOI":"10.48550\/arXiv.2411.12090"},{"key":"e_1_3_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00050"},{"key":"e_1_3_3_1_11_2","doi-asserted-by":"publisher","unstructured":"Michael Heroux Ahmad Abdelfattah Natalie Beams Robert Carson Pieter Ghysels Tzanio Kolev Thomas Stitt Arturo Vargas Stanimire Tomov and Jack Dongarra. 2024. MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. Int. J. High Perform. Comput. Appl. 38 5 (Sept. 2024) 468\u2013490. 10.1177\/10943420241261960","DOI":"10.1177\/10943420241261960"},{"key":"e_1_3_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898718027"},{"key":"e_1_3_3_1_13_2","doi-asserted-by":"publisher","unstructured":"Tsuyoshi Ichimura Kohei Fujita Muneo Hori and Maddegedara Lalith. 2025. Fast and power efficient GPU-based explicit elastic wave propagation analysis by low-ordered orthogonal voxel finite element with INT8 Tensor Cores. Journal of Computational Science 91 (Oct. 2025) 102659. 10.1016\/j.jocs.2025.102659","DOI":"10.1016\/j.jocs.2025.102659"},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","unstructured":"Aditya Kashi Hao Lu Wesley Brewer David Rogers Michael Matheson Mallikarjun Shankar and Feiyi Wang. 2024. Mixed-precision numerics in scientific applications: survey and perspectives. (2024). 10.48550\/arXiv.2412.19322 arxiv:https:\/\/arXiv.org\/abs\/2412.19322","DOI":"10.48550\/arXiv.2412.19322"},{"key":"e_1_3_3_1_15_2","unstructured":"Andrew Kerr Duane Merrill Julien Demouth and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. https:\/\/developer.nvidia.com\/blog\/cutlass-linear-algebra-cuda\/"},{"key":"e_1_3_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652032.3657567"},{"key":"e_1_3_3_1_17_2","doi-asserted-by":"publisher","unstructured":"Paulius Micikevicius Dusan Stosic Neil Burgess Marius Cornea Pradeep Dubey Richard Grisenthwaite Sangwon Ha Alexander Heinecke Patrick Judd John Kamalu Naveen Mellempudi Stuart Oberman Mohammad Shoeybi Michael Siu and Hao Wu. 2022. FP8 Formats for Deep Learning. 10.48550\/arXiv.2209.05433 arxiv:https:\/\/arXiv.org\/abs\/2209.05433\u00a0[cs.LG]","DOI":"10.48550\/arXiv.2209.05433"},{"key":"e_1_3_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/800119.803910"},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-50743-5_12"},{"key":"e_1_3_3_1_20_2","unstructured":"NVIDIA Corporation. 2024. cuSOLVER. https:\/\/docs.nvidia.com\/cuda\/cusolver\/index.html."},{"key":"e_1_3_3_1_21_2","unstructured":"NVIDIA Corporation. 2025. cuBLAS. https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\/tree\/main\/cuBLAS\/Emulation."},{"key":"e_1_3_3_1_22_2","unstructured":"NVIDIA Corporation. 2025. cuBLAS. https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\/tree\/main\/MathDx\/cuBLASDx\/16_dgemm_emulation."},{"key":"e_1_3_3_1_23_2","unstructured":"NVIDIA Corporation. 2025. cuBLAS. https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html."},{"key":"e_1_3_3_1_24_2","doi-asserted-by":"publisher","unstructured":"Hiroyuki Ootomo Katsuhisa Ozaki and Rio Yokota. 2024. DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications 38 4 (Mar. 2024) 297\u2013313. 10.1177\/10943420241239588","DOI":"10.1177\/10943420241239588"},{"key":"e_1_3_3_1_25_2","doi-asserted-by":"publisher","unstructured":"Katsuhisa Ozaki Takeshi Ogita Shin\u2019ichi Oishi and Siegfried\u00a0M. Rump. 2012. Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms 59 (2012) 95\u2013118. 10.1007\/s11075-011-9478-1","DOI":"10.1007\/s11075-011-9478-1"},{"key":"e_1_3_3_1_26_2","doi-asserted-by":"publisher","unstructured":"Katsuhisa Ozaki Yuki Uchino and Toshiyuki Imamura. 2025. Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique. arXiv preprint (2025). 10.48550\/arXiv.2504.08009 arxiv:https:\/\/arXiv.org\/abs\/2504.08009","DOI":"10.48550\/arXiv.2504.08009"},{"key":"e_1_3_3_1_27_2","volume-title":"OCP Microscaling Formats (MX) Specification","author":"Rouhani Bita\u00a0Darvish","year":"2023","unstructured":"Bita\u00a0Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao, Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt, Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander Heinecke, Andres Rodriguez, Martin Langhammer, Maxim Deng Summer\u00a0an dNaumov, Paulius Micikevicius, Michael Siu, and Colin Verrilli. 2023. OCP Microscaling Formats (MX) Specification. Technical Report. Open Compute Project. https:\/\/www.opencompute.org\/documents\/ocp-microscaling-formats-mx-v1-0-spec-final-pdf Revision 1.0."},{"key":"e_1_3_3_1_28_2","doi-asserted-by":"publisher","unstructured":"Robert Schreiber and Charles Van\u00a0Loan. 1989. A Storage-Efficient WY Representation for Products of Householder Transformations. SIAM J. Sci. Statist. Comput. 10 1 (Jan. 1989) 53\u201357. 10.1137\/0910005","DOI":"10.1137\/0910005"},{"key":"e_1_3_3_1_29_2","unstructured":"Ajay Tirumala Joe Eaton and Matt Tyrlik. 2022. Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions. https:\/\/developer.nvidia.com\/blog\/boosting-dynamic-programming-performance-using-nvidia-hopper-gpu-dpx-instructions\/"},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","unstructured":"Yuki Uchino Katsuhisa Ozaki and Toshiyuki Imamura. 2025. High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines. (Nov. 2025) 1824\u20131831. 10.1145\/3731599.3767539 arxiv:https:\/\/arXiv.org\/abs\/2508.03984","DOI":"10.1145\/3731599.3767539"},{"key":"e_1_3_3_1_31_2","doi-asserted-by":"publisher","unstructured":"Yuki Uchino Katsuhisa Ozaki and Toshiyuki Imamura. 2025. Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit. International Journal of High Performance Computing Applications 39 3 (Jan. 2025) 462\u2013476. 10.1177\/10943420241313064","DOI":"10.1177\/10943420241313064"},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"publisher","unstructured":"Field\u00a0G. Van\u00a0Zee and Tyler\u00a0M. Smith. 2017. Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods. ACM Trans. Math. Softw. 44 1 Article 7 (July 2017) 36\u00a0pages. 10.1145\/3086466","DOI":"10.1145\/3086466"}],"event":{"name":"SCA\/HPCAsia 2026: Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region","location":"Osaka Japan","acronym":"SCA\/HPCAsia 2026"},"container-title":["Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region"],"original-title":[],"deposited":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T10:23:47Z","timestamp":1767954227000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773656.3773670"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,25]]},"references-count":31,"alternative-id":["10.1145\/3773656.3773670","10.1145\/3773656"],"URL":"https:\/\/doi.org\/10.1145\/3773656.3773670","relation":{},"subject":[],"published":{"date-parts":[[2026,1,25]]},"assertion":[{"value":"2026-01-25","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}