{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,2]],"date-time":"2026-07-02T23:48:37Z","timestamp":1783036117669,"version":"3.54.6"},"reference-count":12,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2010,11,1]],"date-time":"2010-11-01T00:00:00Z","timestamp":1288569600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2010,11]]},"abstract":"<jats:p>We present an improved matrix\u2014matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi\u2019s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop\/s in double precision and up to 645 GFlop\/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.<\/jats:p>","DOI":"10.1177\/1094342010385729","type":"journal-article","created":{"date-parts":[[2010,11,18]],"date-time":"2010-11-18T05:23:17Z","timestamp":1290057797000},"page":"511-515","source":"Crossref","is-referenced-by-count":140,"title":["An Improved Magma Gemm For Fermi Graphics Processing Units"],"prefix":"10.1177","volume":"24","author":[{"given":"Rajib","family":"Nath","sequence":"first","affiliation":[{"name":"University of Tennassee, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Stanimire","family":"Tomov","sequence":"additional","affiliation":[{"name":"University of Tennassee, USA,"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jack","family":"Dongarra","sequence":"additional","affiliation":[{"name":"University of Tennassee, USA, Oak Ridge National Laboratory, USA, University Of Manchester, UK"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"179","published-online":{"date-parts":[[2010,11,18]]},"reference":[{"key":"atypb1","volume-title":"PLASMA users\u2019 guide","author":"Agullo, E.","year":"2009"},{"key":"atypb2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898719604"},{"key":"atypb3","doi-asserted-by":"publisher","DOI":"10.1145\/1058129.1058148"},{"key":"atypb4","doi-asserted-by":"publisher","DOI":"10.1137\/0613043"},{"key":"atypb5","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-01970-8_89"},{"key":"atypb6","unstructured":"Nath, R., Tomov, S., and Dongarra, J. ( 2010). Accelerating GPU kernels for dense linear algebra. In Proceedings of VEC-PAR\u201910, Berkeley, CA , 22-25 June 2010."},{"key":"atypb7","volume-title":"NVIDIA\u2019s Next Generation CUDA Compute Architecture: Fermi","author":"Nvidia","year":"2009"},{"key":"atypb8","volume-title":"NVIDIA CUDA C Programming Guide, version 3.1.1","author":"Nvidia","year":"2010"},{"key":"atypb9","volume-title":"MAGMA version 0.2 Users\u2019 Guide","author":"Tomov, S.","year":"2009"},{"key":"atypb10","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5214359"},{"key":"atypb11","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00087-9"},{"key":"atypb12","volume-title":"Special-purpose hardware and algorithms for accelerating dense linear algebra","author":"Wolfe, M.","year":"2008"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342010385729","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342010385729","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:18:55Z","timestamp":1777450735000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342010385729"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,11]]},"references-count":12,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2010,11]]}},"alternative-id":["10.1177\/1094342010385729"],"URL":"https:\/\/doi.org\/10.1177\/1094342010385729","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2010,11]]}}}