{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T06:27:10Z","timestamp":1770704830082,"version":"3.49.0"},"reference-count":15,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2019,10,9]],"date-time":"2019-10-09T00:00:00Z","timestamp":1570579200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2017YFB0202303"],"award-info":[{"award-number":["2017YFB0202303"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2020,3]]},"abstract":"<jats:p> The matrix\u2013matrix products for matrices of small size have continued to play an important part in a range of scientific applications. The heterogeneous architecture, which is predicted to be a trend in the exascale supercomputing era, gives rises to the challenges of porting and optimizing small matrix products. We present a method to accelerating and tune small matrix multiplications on Sunway TaihuLight supercomputer, which has been titled as the most powerful supercomputer four times in the Top5000 list. Sunway TaihuLight is equipped with Shen-Wei hybrid manycore processors. We use Nek5000 as a case study to demonstrate our methods. Nek5000 is an open-source computational fluid dynamics (CFD) solver based on the spectral element method (SEM) for incompressible flow. The high-order SEM method, of which the computation kernel is small dense matrix products, is regarded to have the potential to overcome constraints of standard CFD software. By optimizing using vectorization, we gained about 30% performance improvement on management processing element. We accelerated Nek5000 using computing processing elements (CPEs). The experiments results suggest that employing 32 CPEs delivers the best performance enhancements. We scaled Nek5000 to 16,384 core groups with 540,672 cores, reaching about 30% performance improvements. <\/jats:p>","DOI":"10.1177\/1094342019882246","type":"journal-article","created":{"date-parts":[[2019,10,10]],"date-time":"2019-10-10T04:58:49Z","timestamp":1570683529000},"page":"178-186","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":4,"title":["Accelerating and tuning small matrix multiplications on Sunway TaihuLight: A case study of spectral element CFD Code Nek5000"],"prefix":"10.1177","volume":"34","author":[{"given":"Xianmeng","family":"Wang","sequence":"first","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China"}]},{"given":"Zhifeng","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China"}]},{"given":"Changjun","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China"}]},{"given":"Wen","family":"Yang","sequence":"additional","affiliation":[{"name":"China Institute of Atomic Energy, Beijing, China"}]},{"given":"Minfu","family":"Zhao","sequence":"additional","affiliation":[{"name":"China Institute of Atomic Energy, Beijing, China"}]},{"given":"Zhaoshun","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China"}]},{"given":"Peng","family":"Shi","sequence":"additional","affiliation":[{"name":"National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing, China"}]}],"member":"179","published-online":{"date-parts":[[2019,10,9]]},"reference":[{"key":"bibr1-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2017.05.250"},{"key":"bibr2-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2018.01.005"},{"issue":"1","key":"bibr3-1094342019882246","first-page":"11","volume":"15","author":"Ao Y","year":"2018","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"key":"bibr4-1094342019882246","first-page":"69","volume-title":"International conference on exascale applications and software","author":"Cebamanos L","year":"2014"},{"key":"bibr5-1094342019882246","unstructured":"Dongarra J (2016) Report on the Sunway TaihuLight system. UT EECS Technical Reports. University of Tennessee Computer Science Technical Report, UT-EECS-16-742, USA, pp. 1\u201324."},{"key":"bibr6-1094342019882246","unstructured":"Fischer P, Obabko A, Kerkemeier S, et al. (2010) Nek tutorial. URL Available at: https:\/\/www.mcs.anl.gov\/~fischer\/Nek5000\/nek_tutorial_1.pdf (accessed 10 July 2017)."},{"key":"bibr7-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-016-1744-5"},{"key":"bibr8-1094342019882246","doi-asserted-by":"crossref","unstructured":"Gong J, Markidis S, Schliephake M, et al. (2014) Nek5000 with openacc. In: International conference on exascale applications and software, lecture notes in computer science, vol. 8759 (eds Markidis S, Laure E), Stockholm, Sweden, 2\u20133 April 2014. Cham: Springer.","DOI":"10.1007\/978-3-319-15976-8_4"},{"key":"bibr9-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2783929"},{"key":"bibr10-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-016-5588-7"},{"key":"bibr11-1094342019882246","unstructured":"Hess B, Gong J, P\u00e1ll S, et al. (2016) Highly tuned small matrix multiplications applied to spectral element code nek5000. In: The third international workshop on sustainable ultrascale computing systems (NESUS 2016) (eds Carretero J, Blas JG, Margenov S), Sofia, Bulgaria, 6\u20137 October 2016."},{"key":"bibr12-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.51"},{"key":"bibr13-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1177\/1094342015576846"},{"key":"bibr14-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810120"},{"key":"bibr15-1094342019882246","doi-asserted-by":"publisher","DOI":"10.1029\/2018MS001276"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342019882246","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/1094342019882246","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342019882246","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T09:38:16Z","timestamp":1740994696000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342019882246"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,9]]},"references-count":15,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,3]]}},"alternative-id":["10.1177\/1094342019882246"],"URL":"https:\/\/doi.org\/10.1177\/1094342019882246","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,9]]}}}