{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:17:29Z","timestamp":1750220249221,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":34,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,28]],"date-time":"2022-06-28T00:00:00Z","timestamp":1656374400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"HPC-Europa3 Transnational Access programme","award":["HPC17DD65H"],"award-info":[{"award-number":["HPC17DD65H"]}]},{"name":"FAPERGS","award":["19\/2551-0001689-1"],"award-info":[{"award-number":["19\/2551-0001689-1"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,28]]},"DOI":"10.1145\/3524059.3532385","type":"proceedings-article","created":{"date-parts":[[2022,6,16]],"date-time":"2022-06-16T16:13:11Z","timestamp":1655395991000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Seamless optimization of the GEMM kernel for task-based programming models"],"prefix":"10.1145","author":[{"given":"Arthur F.","family":"Lorenzon","sequence":"first","affiliation":[{"name":"Federal University of Pampa, Alegrete, RS, Brazil"}]},{"given":"Sandro M. V. N.","family":"Marques","sequence":"additional","affiliation":[{"name":"Federal University of Pampa, Alegrete, RS, Brazil"}]},{"given":"Antoni","family":"Navarro","sequence":"additional","affiliation":[{"name":"Barcelona Supercomputing Center (BSC), Barcelona, Spain"}]},{"given":"Vicen\u00e7","family":"Beltran","sequence":"additional","affiliation":[{"name":"Barcelona Supercomputing Center (BSC), Barcelona, Spain"}]}],"member":"320","published-online":{"date-parts":[[2022,6,28]]},"reference":[{"volume-title":"Intel Math Kernel Library. Reference Manual","unstructured":"2009. Intel Math Kernel Library. Reference Manual . Intel Corporation , Santa Clara, USA. ISBN 630813-054US. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation, Santa Clara, USA. ISBN 630813-054US.","key":"e_1_3_2_1_1_1"},{"unstructured":"2012. AMD Core Math Library (ACML) User Guide. Advanced Micro Systems (AMD) Santa Ana USA. https:\/\/developer.amd.com\/wordpress\/media\/2012\/10\/acml_userguide.pdf\\.pdf  2012. AMD Core Math Library (ACML) User Guide. Advanced Micro Systems (AMD) Santa Ana USA. https:\/\/developer.amd.com\/wordpress\/media\/2012\/10\/acml_userguide.pdf\\.pdf","key":"e_1_3_2_1_2_1"},{"unstructured":"2021. AMD Optimizing CPU Libraries User Guide. Advanced Micro Systems (AMD) Santa Ana USA. https:\/\/developer.amd.com\/wp-content\/resources\/AOCL_User%20Guide_3.0.pdf\/  2021. AMD Optimizing CPU Libraries User Guide. Advanced Micro Systems (AMD) Santa Ana USA. https:\/\/developer.amd.com\/wp-content\/resources\/AOCL_User%20Guide_3.0.pdf\/","key":"e_1_3_2_1_3_1"},{"volume-title":"Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus","author":"Abdelfattah Ahmad","unstructured":"Ahmad Abdelfattah , Stanimire Tomov , and Jack Dongarra . 2019. Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus . In IEEE IPDPS. IEEE , 111--122. Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra. 2019. Fast batched matrix multiplication for small sizes using half-precision arithmetic on gpus. In IEEE IPDPS. IEEE, 111--122.","key":"e_1_3_2_1_4_1"},{"unstructured":"Emmanuel Agullo C\u00e9dric Augonnet Jack Dongarra Hatem Ltaief Raymond Namyst Samuel Thibault and Stanimire Tomov. 2010. Faster Cheaper Better - a Hybridization Methodology to Develop Linear Algebra Software for GPUs.  Emmanuel Agullo C\u00e9dric Augonnet Jack Dongarra Hatem Ltaief Raymond Namyst Samuel Thibault and Stanimire Tomov. 2010. Faster Cheaper Better - a Hybridization Methodology to Develop Linear Algebra Software for GPUs.","key":"e_1_3_2_1_5_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_6_1","DOI":"10.1109\/TPDS.2017.2766064"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_7_1","DOI":"10.1145\/3437801.3441601"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_8_1","DOI":"10.1145\/2858788.2688513"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_9_1","DOI":"10.1145\/77626.79170"},{"volume-title":"Task-Based Cholesky Decomposition on Knights Corner Using OpenMP","author":"Dorris Joseph","unstructured":"Joseph Dorris , Jakub Kurzak , Piotr Luszczek , Asim YarKhan , and Jack Dongarra . 2016. Task-Based Cholesky Decomposition on Knights Corner Using OpenMP . In High Performance Computing, Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.). Springer International Publishing , Cham , 544--562. Joseph Dorris, Jakub Kurzak, Piotr Luszczek, Asim YarKhan, and Jack Dongarra. 2016. Task-Based Cholesky Decomposition on Knights Corner Using OpenMP. In High Performance Computing, Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.). Springer International Publishing, Cham, 544--562.","key":"e_1_3_2_1_10_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_11_1","DOI":"10.1145\/3295500.3356223"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_12_1","DOI":"10.1145\/2425248.2425252"},{"volume-title":"Design and implementation of the linpack benchmark for single and multi-node systems based on intel\u00ae xeon phi coprocessor","author":"Heinecke Alexander","unstructured":"Alexander Heinecke , Karthikeyan Vaidyanathan , Mikhail Smelyanskiy , Alexander Kobotov , Roman Dubtsov , Greg Henry , Aniruddha G Shet , George Chrysos , and Pradeep Dubey . 2013. Design and implementation of the linpack benchmark for single and multi-node systems based on intel\u00ae xeon phi coprocessor . In IEEE IPDPS. IEEE , 126--137. Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the linpack benchmark for single and multi-node systems based on intel\u00ae xeon phi coprocessor. In IEEE IPDPS. IEEE, 126--137.","key":"e_1_3_2_1_13_1"},{"key":"e_1_3_2_1_14_1","volume-title":"SC20: Int. Conf. for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14","author":"Jeon Yongkweon","year":"2020","unstructured":"Yongkweon Jeon , Baeseong Park , Se Jung Kwon , Byeongwook Kim , Jeongin Yun , and Dongsoo Lee . 2020 . BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized DNNs . In SC20: Int. Conf. for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14 . Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, and Dongsoo Lee. 2020. BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized DNNs. In SC20: Int. Conf. for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_15_1","DOI":"10.1109\/ICPADS.2015.68"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_16_1","DOI":"10.1109\/ICPP.2017.51"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_17_1","DOI":"10.1145\/3293320.3293334"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_18_1","DOI":"10.1002\/cpe.1467"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_19_1","DOI":"10.1145\/3176364.3176374"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_20_1","DOI":"10.1109\/HiPC.2019.00053"},{"volume-title":"OpenMP Taskloop Dependences","author":"Maro\u00f1as Marcos","unstructured":"Marcos Maro\u00f1as , Xavier Teruel , and Vicen\u00e7 Beltran . 2021. OpenMP Taskloop Dependences . In OpenMP: Enabling Massive Node-Level Parallelism, Simon McIntosh-Smith, Bronis R. de Supinski, and Jannis Klinkenberg (Eds.). Springer Int. Publishing , Cham, 50--64. Marcos Maro\u00f1as, Xavier Teruel, and Vicen\u00e7 Beltran. 2021. OpenMP Taskloop Dependences. In OpenMP: Enabling Massive Node-Level Parallelism, Simon McIntosh-Smith, Bronis R. de Supinski, and Jannis Klinkenberg (Eds.). Springer Int. Publishing, Cham, 50--64.","key":"e_1_3_2_1_21_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_22_1","DOI":"10.1109\/HPCC.2012.28"},{"key":"e_1_3_2_1_23_1","volume-title":"Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication. arXiv preprint arXiv:2106.10499","author":"Moon Gordon E","year":"2021","unstructured":"Gordon E Moon , Hyoukjun Kwon , Geonhwa Jeong , Prasanth Chatarasi , Sivasankaran Rajamanickam , and Tushar Krishna . 2021. Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication. arXiv preprint arXiv:2106.10499 ( 2021 ). Gordon E Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, and Tushar Krishna. 2021. Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication. arXiv preprint arXiv:2106.10499 (2021)."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_24_1","DOI":"10.1007\/978-3-642-33078-0_32"},{"unstructured":"OpenMP Architecture Review Board. 2018. OpenMP Application Programming Interface. https:\/\/www.openmp.org\/wp-content\/uploads\/OpenMP-API-Specification-5.0.pdf Accessed: 2019-03-24.  OpenMP Architecture Review Board. 2018. OpenMP Application Programming Interface. https:\/\/www.openmp.org\/wp-content\/uploads\/OpenMP-API-Specification-5.0.pdf Accessed: 2019-03-24.","key":"e_1_3_2_1_25_1"},{"key":"e_1_3_2_1_26_1","volume-title":"Thi My Tuyen Nguyen, and Jaeyoung Choi","author":"Park Yoosang","year":"2021","unstructured":"Yoosang Park , Raehyun Kim , Thi My Tuyen Nguyen, and Jaeyoung Choi . 2021 . Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Cluster Computing ( 2021), 1--11. Yoosang Park, Raehyun Kim, Thi My Tuyen Nguyen, and Jaeyoung Choi. 2021. Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Cluster Computing (2021), 1--11."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_27_1","DOI":"10.1109\/IPDPS.2017.69"},{"volume-title":"Comparison of threading programming models","author":"Salehian Solmaz","unstructured":"Solmaz Salehian , Jiawen Liu , and Yonghong Yan . 2017. Comparison of threading programming models . In IEEE IPDPSW. IEEE , 766--774. Solmaz Salehian, Jiawen Liu, and Yonghong Yan. 2017. Comparison of threading programming models. In IEEE IPDPSW. IEEE, 766--774.","key":"e_1_3_2_1_28_1"},{"key":"e_1_3_2_1_29_1","volume-title":"Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee.","author":"Smith Tyler M","year":"2014","unstructured":"Tyler M Smith , Robert Van De Geijn , Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014 . Anatomy of high-performance many-threaded matrix multiplication. In IEEE IPDPS. IEEE , 1049--1059. Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In IEEE IPDPS. IEEE, 1049--1059."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_30_1","DOI":"10.1145\/2063384.2063431"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_31_1","DOI":"10.1007\/978-3-642-40698-0_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_32_1","DOI":"10.1109\/PDP2018.2018.00065"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_33_1","DOI":"10.1145\/2764454"},{"key":"e_1_3_2_1_34_1","volume-title":"OpenBLAS: An optimized BLAS library. URL: http:\/\/xianyi.github.io\/OpenBLAS","author":"Xianyi Zhang","year":"2021","unstructured":"Zhang Xianyi , Wang Qian , and Werner Saar . 2021. OpenBLAS: An optimized BLAS library. URL: http:\/\/xianyi.github.io\/OpenBLAS ( 2021 ). Zhang Xianyi, Wang Qian, and Werner Saar. 2021. OpenBLAS: An optimized BLAS library. URL: http:\/\/xianyi.github.io\/OpenBLAS (2021)."}],"event":{"sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"],"acronym":"ICS '22","name":"ICS '22: 2022 International Conference on Supercomputing","location":"Virtual Event"},"container-title":["Proceedings of the 36th ACM International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524059.3532385","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3524059.3532385","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:38Z","timestamp":1750188638000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3524059.3532385"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,28]]},"references-count":34,"alternative-id":["10.1145\/3524059.3532385","10.1145\/3524059"],"URL":"https:\/\/doi.org\/10.1145\/3524059.3532385","relation":{},"subject":[],"published":{"date-parts":[[2022,6,28]]},"assertion":[{"value":"2022-06-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}