{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T21:27:20Z","timestamp":1769203640937,"version":"3.49.0"},"reference-count":11,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"},{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["22KJ2741"],"award-info":[{"award-number":["22KJ2741"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["23K28100"],"award-info":[{"award-number":["23K28100"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["24K23874"],"award-info":[{"award-number":["24K23874"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2025,5]]},"abstract":"<jats:p>This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.<\/jats:p>","DOI":"10.1177\/10943420241313064","type":"journal-article","created":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T10:07:59Z","timestamp":1736417279000},"page":"462-476","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":6,"title":["Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit"],"prefix":"10.1177","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5906-6624","authenticated-orcid":false,"given":"Yuki","family":"Uchino","sequence":"first","affiliation":[{"name":"RIKEN Center for Computational Science"}]},{"given":"Katsuhisa","family":"Ozaki","sequence":"additional","affiliation":[{"name":"Shibaura Institute of Technology"}]},{"given":"Toshiyuki","family":"Imamura","sequence":"additional","affiliation":[{"name":"RIKEN Center for Computational Science"}]}],"member":"179","published-online":{"date-parts":[[2025,1,9]]},"reference":[{"key":"e_1_3_4_2_1","doi-asserted-by":"publisher","DOI":"10.1137\/18M1226312"},{"key":"e_1_3_4_3_1","volume-title":"IEEE Std 754-2019 (Revision of IEEE 754-2008)","author":"IEEE Computer Society","year":"2019","unstructured":"IEEE Computer Society (2019) IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008). NJ, USA: IEEE."},{"key":"e_1_3_4_4_1","doi-asserted-by":"publisher","DOI":"10.1137\/120894488"},{"key":"e_1_3_4_5_1","unstructured":"Minamihata A Ozaki K Ogita T et al. (2016) Improved extraction scheme for accurate floating-point summation. In: The 35th JSST Annual Conference International Conference on Simulation Technology Kyoto Japan."},{"key":"e_1_3_4_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-50743-5_12"},{"key":"e_1_3_4_7_1","unstructured":"NVIDIA Corporation (2024) NVIDIA tensor cores. URL: https:\/\/www.nvidia.com\/en-us\/data-center\/tensor-cores\/."},{"key":"e_1_3_4_8_1","unstructured":"Ootomo H (2024) ozIMMU - DGEMM on Int8 tensor Core. URL: https:\/\/github.com\/enp1s0\/ozIMMU."},{"key":"e_1_3_4_9_1","doi-asserted-by":"publisher","DOI":"10.1177\/10943420241239588"},{"key":"e_1_3_4_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11075-011-9478-1"},{"key":"e_1_3_4_11_1","doi-asserted-by":"publisher","DOI":"10.1587\/nolta.4.2"},{"key":"e_1_3_4_12_1","unstructured":"Uchino Y (2024) Accelerator for ozIMMU. R-CCS github repositry. URL: https:\/\/github.com\/RIKEN-RCCS\/accelerator_for_ozIMMU."}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420241313064","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420241313064","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420241313064","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T18:46:21Z","timestamp":1767638781000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420241313064"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,9]]},"references-count":11,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,5]]}},"alternative-id":["10.1177\/10943420241313064"],"URL":"https:\/\/doi.org\/10.1177\/10943420241313064","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,9]]}}}