{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,26]],"date-time":"2026-06-26T20:43:38Z","timestamp":1782506618596,"version":"3.54.5"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T00:00:00Z","timestamp":1583280000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Beijing Municipal Natural Science Foundation","award":["JQ18001"],"award-info":[{"award-number":["JQ18001"]}]},{"name":"National Key Research and Development Program of China","award":["2016YFB0200603"],"award-info":[{"award-number":["2016YFB0200603"]}]},{"name":"Beijing Academy of Artificial Intelligence"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,3,31]]},"abstract":"<jats:p>We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning\u2013based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.<\/jats:p>","DOI":"10.1145\/3378176","type":"journal-article","created":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T12:50:12Z","timestamp":1583326212000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor"],"prefix":"10.1145","volume":"17","author":[{"given":"Lijuan","family":"Jiang","sequence":"first","affiliation":[{"name":"Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chao","family":"Yang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wenjing","family":"Ma","sequence":"additional","affiliation":[{"name":"Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2020,3,4]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Intel Corporation. 2019. https:\/\/software.intel.com\/en-us\/intel-mkl. Intel Corporation. 2019. https:\/\/software.intel.com\/en-us\/intel-mkl."},{"key":"e_1_2_1_2_1","unstructured":"NVIDIA Corporation. 2019. https:\/\/docs.nvidia.com\/cuda\/cublas\/. NVIDIA Corporation. 2019. https:\/\/docs.nvidia.com\/cuda\/cublas\/."},{"key":"e_1_2_1_3_1","unstructured":"Eigen project. 2019. http:\/\/eigen.tuxfamily.org\/index.php?title=Main_Page. Eigen project. 2019. http:\/\/eigen.tuxfamily.org\/index.php?title=Main_Page."},{"key":"e_1_2_1_4_1","unstructured":"LAPACK project. 2019. http:\/\/www.netlib.org\/lapack\/. LAPACK project. 2019. http:\/\/www.netlib.org\/lapack\/."},{"key":"e_1_2_1_5_1","unstructured":"MAGMA project. 2019. http:\/\/icl.cs.utk.edu\/magma\/. MAGMA project. 2019. http:\/\/icl.cs.utk.edu\/magma\/."},{"key":"e_1_2_1_6_1","volume-title":"Matthieu Devin et al","author":"Abadi Mart\u0131n","year":"2015","unstructured":"Mart\u0131n Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S. Corrado , Andy Davis , Jeffrey Dean , Matthieu Devin et al . 2015 . TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org 1, 2 (2015). http:\/\/dx.doi.org\/10.1177\/1094342010385729 10.1177\/1094342010385729 Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org 1, 2 (2015). http:\/\/dx.doi.org\/10.1177\/1094342010385729"},{"key":"e_1_2_1_7_1","volume-title":"Ian Masliah et al","author":"Abdelfattah Ahmad","year":"2016","unstructured":"Ahmad Abdelfattah , Marc Baboulin , Veselin Dobrev , Jack Dongarra , Christopher Earl , Joel Falcou , Azzam Haidar , Ian Karlin , Tz Kolev , Ian Masliah et al . 2016 . High-performance tensor contractions for GPUs. Proc. Comput. Sci. 80 (2016). Elsevier , 108--118. Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, Christopher Earl, Joel Falcou, Azzam Haidar, Ian Karlin, Tz Kolev, Ian Masliah et al. 2016. High-performance tensor contractions for GPUs. Proc. Comput. Sci. 80 (2016). Elsevier, 108--118."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-41321-1_2"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079103"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2015.9"},{"key":"e_1_2_1_11_1","volume-title":"Alexander Belopolsky et al","author":"Al-Rfou Rami","year":"2016","unstructured":"Rami Al-Rfou , Guillaume Alain , Amjad Almahairi , Christof Angermueller , Dzmitry Bahdanau , Nicolas Ballas , Fr\u00e9d\u00e9ric Bastien , Justin Bayer , Anatoly Belikov , Alexander Belopolsky et al . 2016 . Theano : A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs\/1605.02688 (2016). Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr\u00e9d\u00e9ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs\/1605.02688 (2016)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1080\/00031305.1992.10475879","article-title":"An introduction to kernel and nearest-neighbor nonparametric regression","volume":"46","author":"Altman Naomi S.","year":"1992","unstructured":"Naomi S. Altman . 1992 . An introduction to kernel and nearest-neighbor nonparametric regression . Amer. Statist. 46 , 3 (1992), 175 -- 185 . Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. 46, 3 (1992), 175--185.","journal-title":"Amer. Statist."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1080\/00268970500275780"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"50","author":"Austin","unstructured":"Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication . In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , Vol. 50 . ACM, 42--53. Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Vol. 50. ACM, 42--53."},{"key":"e_1_2_1_15_1","volume-title":"Classification and regression trees","author":"Breiman Leo","year":"1984","unstructured":"Leo Breiman , Jerome Friedman , Richard Olshen , and Charles Stone . 1984. Classification and regression trees . Belmont, CA : Wadsworth International Group ( 1984 ). https:\/\/doi.org\/10.1201\/9781315139470 10.1201\/9781315139470 Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. Classification and regression trees. Belmont, CA: Wadsworth International Group (1984). https:\/\/doi.org\/10.1201\/9781315139470"},{"key":"e_1_2_1_16_1","unstructured":"Cris Cecka. 2017. Pro Tip: cuBLAS Strided Batched Matrix Multiply. Retrieved from https:\/\/devblogs.nvidia.com\/cublas-strided-batched-matrix-multiply\/. Cris Cecka. 2017. Pro Tip: cuBLAS Strided Batched Matrix Multiply. Retrieved from https:\/\/devblogs.nvidia.com\/cublas-strided-batched-matrix-multiply\/."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1093\/nsr\/nww044"},{"key":"e_1_2_1_18_1","volume-title":"Stanimire Tomov et al","author":"Dongarra Jack","year":"2016","unstructured":"Jack Dongarra , Iain Duff , Mark Gates , Azzam Haidar , Sven Hammarling , Nicholas J. Higham , Jonathon Hogg , Pedro Valero-Lara , Samuel D. Relton , Stanimire Tomov et al . 2016 . A proposed API for batched basic linear algebra subprograms. Manchester Institute for Mathematical Sciences, University of Manchester ( 2016). http:\/\/eprints.ma.man.ac.uk\/2464\/. Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathon Hogg, Pedro Valero-Lara, Samuel D. Relton, Stanimire Tomov et al. 2016. A proposed API for batched basic linear algebra subprograms. Manchester Institute for Mathematical Sciences, University of Manchester (2016). http:\/\/eprints.ma.man.ac.uk\/2464\/."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2017.05.138"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/77626.79170"},{"key":"e_1_2_1_21_1","first-page":"1","article-title":"18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing","volume":"2","author":"Fu Haohuan","year":"2017","unstructured":"Haohuan Fu , Conghui He , Bingwei Chen , Zekun Yin , Zhenguo Zhang , Wenqiang Zhang , Tingjian Zhang , Wei Xue , Weiguo Liu , Wanwang Yin et al. 2017 . 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing , Networking, Storage and Analysis. ACM , 2 : 1 -- 2 :12. Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin et al. 2017. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2:1--2:12.","journal-title":"Networking, Storage and Analysis. ACM"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-016-5588-7"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.4304\/jcp.9.7.1566-1571"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_25_1","volume-title":"van de Geijn","author":"Gunnels John A.","year":"2001","unstructured":"John A. Gunnels , Greg M. Henry , and Robert A . van de Geijn . 2001 . A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Science. Springer , 51--60. John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Science. Springer, 51--60."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.83"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.113"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 656--667","author":"Huang Jianyu","unstructured":"Jianyu Huang , Leslie Rice , Devin A. Matthews , and Robert A . van de Geijn. 2017. Generating families of practical fast matrix multiplication algorithms . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 656--667 . Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. 2017. Generating families of practical fast matrix multiplication algorithms. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 656--667."},{"key":"e_1_2_1_29_1","volume-title":"van de Geijn","author":"Huang Jianyu","year":"2016","unstructured":"Jianyu Huang , Tyler M. Smith , Greg M. Henry , and Robert A . van de Geijn . 2016 . Strassen\u2019s algorithm reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage. and Analysis. IEEE Press , 59. Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen\u2019s algorithm reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage. and Analysis. IEEE Press, 59."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.09.003"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.51"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ymben.2014.05.014"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3850\/9783981537079_0647"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295734"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.2313342"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.52"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-43659-3_48"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2018.10.003"},{"key":"e_1_2_1_39_1","volume-title":"Suzanne Parete-Koon, and Merek A. Chertkow.","author":"Messer O. E.","year":"2012","unstructured":"O. E. Messer , J. Austin Harris , Suzanne Parete-Koon, and Merek A. Chertkow. 2012 . Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing. Springer-Verlag , 92--106. O. E. Messer, J. Austin Harris, Suzanne Parete-Koon, and Merek A. Chertkow. 2012. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of the 11th International Conference on Applied Parallel and Scientific Computing. Springer-Verlag, 92--106."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1964218.1964227"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342010385729"},{"key":"e_1_2_1_42_1","unstructured":"NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf. NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1137\/140993478"},{"key":"e_1_2_1_44_1","first-page":"18","article-title":"Spectral\/hp element methods for computational fluid dynamics. Oxford Sci","volume":"17","author":"Sherwin S. J.","year":"2005","unstructured":"S. J. Sherwin and G. E. Karniadakis . 2005 . Spectral\/hp element methods for computational fluid dynamics. Oxford Sci . Public. 17 (2005), 18 . S. J. Sherwin and G. E. Karniadakis. 2005. Spectral\/hp element methods for computational fluid dynamics. Oxford Sci. Public. 17 (2005), 18.","journal-title":"Public."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC.2016.031"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02165411"},{"key":"e_1_2_1_47_1","first-page":"1","article-title":"Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing","volume":"35","author":"Tan Guangming","year":"2011","unstructured":"Guangming Tan , Linchuan Li , Sean Triechle , Everett Phillips , Yungang Bao , and Ninghui Sun . 2011 . Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing , Networking, Storage, and Analysis. ACM , 35 : 1 -- 35 :11. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on Fermi GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, 35:1--35:11.","journal-title":"Networking, Storage, and Analysis. ACM"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27","author":"Clint Whaley R.","unstructured":"R. Clint Whaley and Jack J. Dongarra . 1998. Automatically tuned linear algebra software . In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27 . R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27."},{"key":"e_1_2_1_49_1","volume-title":"Artificial Neural Networks: Approximation and Learning Theory","author":"White Halbert","unstructured":"Halbert White . 1992. Artificial Neural Networks: Approximation and Learning Theory . Blackwell Publishers, Inc. Halbert White. 1992. Artificial Neural Networks: Approximation and Learning Theory. Blackwell Publishers, Inc."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/3014904.3014912"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.5555\/3014904.3014910"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-015-1510-9"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3378176","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3378176","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:24:00Z","timestamp":1750202640000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3378176"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,4]]},"references-count":53,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,3,31]]}},"alternative-id":["10.1145\/3378176"],"URL":"https:\/\/doi.org\/10.1145\/3378176","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,4]]},"assertion":[{"value":"2019-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}