{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T16:51:25Z","timestamp":1771951885159,"version":"3.50.1"},"reference-count":77,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,1,7]],"date-time":"2021-01-07T00:00:00Z","timestamp":1609977600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"UNIBZ RTD call 2018","award":["IN2087"],"award-info":[{"award-number":["IN2087"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,3,31]]},"abstract":"<jats:p>Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.<\/jats:p>","DOI":"10.1145\/3434402","type":"journal-article","created":{"date-parts":[[2021,1,8]],"date-time":"2021-01-08T05:12:19Z","timestamp":1610082739000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond"],"prefix":"10.1145","volume":"18","author":[{"given":"Paolo Sylos","family":"Labini","sequence":"first","affiliation":[{"name":"Free University of Bozen-Bolzano, Bozen-Bolzano, Italy"}]},{"given":"Marco","family":"Cianfriglia","sequence":"additional","affiliation":[{"name":"National Research Council of Italy, Italy"}]},{"given":"Damiano","family":"Perri","sequence":"additional","affiliation":[{"name":"University of Perugia, Italy"}]},{"given":"Osvaldo","family":"Gervasi","sequence":"additional","affiliation":[{"name":"University of Perugia, Italy"}]},{"given":"Grigori","family":"Fursin","sequence":"additional","affiliation":[{"name":"ctuning Foundation, France"}]},{"given":"Anton","family":"Lokhmotov","sequence":"additional","affiliation":[{"name":"Dividiti, United Kingdom"}]},{"given":"Cedric","family":"Nugteren","sequence":"additional","affiliation":[{"name":"TomTom, Netherlands"}]},{"given":"Bruno","family":"Carpentieri","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano,Bozen-Bolzano, Italy"}]},{"given":"Fabiana","family":"Zollo","sequence":"additional","affiliation":[{"name":"Ca\u2019 Foscari University of Venice, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5676-9228","authenticated-orcid":false,"given":"Flavio","family":"Vella","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano, Bozen-Bolzano, Italy"}]}],"member":"320","published-online":{"date-parts":[[2021,1,7]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628092"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628092"},{"key":"e_1_2_1_3_1","unstructured":"ARM. 2018. A Software Library for Computer Vision and Machine Learning. Retrieved from https:\/\/www.arm.com\/why-arm\/technologies\/compute-library.  ARM. 2018. A Software Library for Computer Vision and Machine Learning. Retrieved from https:\/\/www.arm.com\/why-arm\/technologies\/compute-library."},{"key":"e_1_2_1_4_1","volume-title":"Article 96 (Sep.","author":"Ashouri Amir H.","year":"2018","unstructured":"Amir H. Ashouri , William Killian , John Cavazos , Gianluca Palermo , and Cristina Silvano . 2018. A survey on compiler autotuning using machine learning. ACM Comput. Surv. 51, 5 , Article 96 (Sep. 2018 ), 42 pages. DOI:https:\/\/doi.org\/10.1145\/3197978 10.1145\/3197978 Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A survey on compiler autotuning using machine learning. ACM Comput. Surv. 51, 5, Article 96 (Sep. 2018), 42 pages. DOI:https:\/\/doi.org\/10.1145\/3197978"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912)","author":"Bergstra J.","year":"2012","unstructured":"J. Bergstra , N. Pinto , and D. Cox . 2012. Machine learning for predictive auto-tuning with boosted regression trees . In Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912) . 1--9. DOI:https:\/\/doi.org\/10.1109\/InPar. 2012 .6339587 10.1109\/InPar.2012.6339587 J. Bergstra, N. Pinto, and D. Cox. 2012. Machine learning for predictive auto-tuning with boosted regression trees. In Proceedings of the Conference on Innovative Parallel Computing (InPar\u201912). 1--9. DOI:https:\/\/doi.org\/10.1109\/InPar.2012.6339587"},{"key":"e_1_2_1_6_1","first-page":"1","article-title":"Multilevel parallelism for the exploration of large-scale graphs","volume":"4","author":"Bernaschi M.","year":"2018","unstructured":"M. Bernaschi , M. Bisson , E. Mastrostefano , and F. Vella . 2018 . Multilevel parallelism for the exploration of large-scale graphs . IEEE Trans. Multi-Scale Comput. Syst. 4 , 3 (2018), 1 -- 1 . DOI:https:\/\/doi.org\/10.1109\/TMSCS.2018.2797195 10.1109\/TMSCS.2018.2797195 M. Bernaschi, M. Bisson, E. Mastrostefano, and F. Vella. 2018. Multilevel parallelism for the exploration of large-scale graphs. IEEE Trans. Multi-Scale Comput. Syst. 4, 3 (2018), 1--1. DOI:https:\/\/doi.org\/10.1109\/TMSCS.2018.2797195","journal-title":"IEEE Trans. Multi-Scale Comput. Syst."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2903150.2903153"},{"key":"e_1_2_1_8_1","unstructured":"Somashekaracharya G. Bhaskaracharya Julien Demouth and Vinod Grover. 2020. Automatic kernel generation for volta tensor cores. Retrieved from https:\/\/arXiv:2006.12645.  Somashekaracharya G. Bhaskaracharya Julien Demouth and Vinod Grover. 2020. Automatic kernel generation for volta tensor cores. Retrieved from https:\/\/arXiv:2006.12645."},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Q. Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: An automated end-to-end optimizing compiler for deep learning . In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918) . Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)."},{"key":"e_1_2_1_10_1","volume-title":"Advances in Neural Information Processing Systems","author":"Chen Tianqi","unstructured":"Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018. Learning to optimize tensor programs . In Advances in Neural Information Processing Systems . MIT Press , 3389--3400. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. MIT Press, 3389--3400."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation. IEEE, 120--127","author":"Choi Jaeyoung","unstructured":"Jaeyoung Choi , Jack J. Dongarra , Roldan Pozo , and David W. Walker . 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers . In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation. IEEE, 120--127 . Jaeyoung Choi, Jack J. Dongarra, Roldan Pozo, and David W. Walker. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation. IEEE, 120--127."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201910)","author":"Choi Jee W.","unstructured":"Jee W. Choi , Amik Singh , and Richard W. Vuduc . 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs . In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201910) . ACM, New York, NY, 115--126. DOI:https:\/\/doi.org\/10.1145\/1693453.1693471 10.1145\/1693453.1693471 Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201910). ACM, New York, NY, 115--126. DOI:https:\/\/doi.org\/10.1145\/1693453.1693471"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00994018"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917)","author":"Cosenza B.","year":"2017","unstructured":"B. Cosenza , J. J. Durillo , S. Ermon , and B. Juurlink . 2017. Autotuning stencil computations with structural ordinal regression learning . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917) . 287--296. DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2017 .102 10.1109\/IPDPS.2017.102 B. Cosenza, J. J. Durillo, S. Ermon, and B. Juurlink. 2017. Autotuning stencil computations with structural ordinal regression learning. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917). 287--296. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2017.102"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2019.8661187"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470479"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917)","author":"Girolamo S. Di","year":"2017","unstructured":"S. Di Girolamo , F. Vella , and T. Hoefler . 2017. Transparent caching for RMA systems . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917) . 1018--1027. DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2017 .92 10.1109\/IPDPS.2017.92 S. Di Girolamo, F. Vella, and T. Hoefler. 2017. Transparent caching for RMA systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201917). 1018--1027. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2017.92"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2813885.2737969"},{"key":"e_1_2_1_19_1","volume-title":"GPU-based parallelism for ASP-solving","author":"Dovier Agostino","unstructured":"Agostino Dovier , Andrea Formisano , and Flavio Vella . 2020. GPU-based parallelism for ASP-solving . In Declarative Programming and Knowledge Management, Petra Hofstedt, Salvador Abreu, Ulrich John, Herbert Kuchen, and Dietmar Seipel (Eds.). Springer International Publishing , Cham , 3--23. Agostino Dovier, Andrea Formisano, and Flavio Vella. 2020. GPU-based parallelism for ASP-solving. In Declarative Programming and Knowledge Management, Petra Hofstedt, Salvador Abreu, Ulrich John, Herbert Kuchen, and Dietmar Seipel (Eds.). Springer International Publishing, Cham, 3--23."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium Workshop (IPDPSW\u201915)","author":"Thomas","unstructured":"Thomas L. Falch and Anne C. Elster. 2015. Machine learning based auto-tuning for enhanced opencl performance portability . In Proceedings of the International Parallel and Distributed Processing Symposium Workshop (IPDPSW\u201915) . IEEE, 1231--1240. Thomas L. Falch and Anne C. Elster. 2015. Machine learning based auto-tuning for enhanced opencl performance portability. In Proceedings of the International Parallel and Distributed Processing Symposium Workshop (IPDPSW\u201915). IEEE, 1231--1240."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4029"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201916)","author":"Fursin G.","unstructured":"G. Fursin , A. Lokhmotov , and E. Plowman . 2016. Collective knowledge: Towards R&D sustainability . In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201916) . 864--869. G. Fursin, A. Lokhmotov, and E. Plowman. 2016. Collective knowledge: Towards R&D sustainability. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201916). 864--869."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2014.40"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_25_1","unstructured":"Ananth Grama. 2003. Introduction to Parallel Computing. Pearson Education.  Ananth Grama. 2003. Introduction to Parallel Computing. Pearson Education."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536370"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 3rd International Conference on Document Analysis and Recognition","volume":"1","author":"Ho Tin Kam","year":"1995","unstructured":"Tin Kam Ho . 1995 . Random decision forests . In Proceedings of the 3rd International Conference on Document Analysis and Recognition , Vol. 1 . IEEE, 278--282. Tin Kam Ho. 1995. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Vol. 1. IEEE, 278--282."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2017.155"},{"key":"e_1_2_1_29_1","unstructured":"Forrest N. Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with fewer parameters and &lt;0.5 MB model size. Retrieved from https:\/\/arXiv:1602.07360.  Forrest N. Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with fewer parameters and &lt;0.5 MB model size. Retrieved from https:\/\/arXiv:1602.07360."},{"key":"e_1_2_1_30_1","unstructured":"Intel. 2018. Intel Math Kernel Library. Reference Manual. Retrieved from https:\/\/software.intel.com\/sites\/default\/files\/managed\/83\/0a\/mkl-2018-developer-reference-c_0.pdf.  Intel. 2018. Intel Math Kernel Library. Reference Manual. Retrieved from https:\/\/software.intel.com\/sites\/default\/files\/managed\/83\/0a\/mkl-2018-developer-reference-c_0.pdf."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293320.3293334"},{"key":"e_1_2_1_32_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT Press , 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT Press, 1097--1105."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2011.311"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913)","author":"Lai Junjie","year":"2013","unstructured":"Junjie Lai and Andr\u00e9 Seznec . 2013 . Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs . In Proceedings of the IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913) . IEEE, 1--10. Junjie Lai and Andr\u00e9 Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913). IEEE, 1--10."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-018-2702-1"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201998)","volume":"98","author":"\u00a0al Dekang Lin","year":"1998","unstructured":"Dekang Lin et \u00a0al . 1998 . An information-theoretic definition of similarity . In Proceedings of the International Conference on Machine Learning (ICML\u201998) , Vol. 98 . Citeseer, 296--304. Dekang Lin et\u00a0al. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning (ICML\u201998), Vol. 98. Citeseer, 296--304."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3229762.3229767"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2458523.2458530"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Partha Maji Andrew Mundy Ganesh Dasika Jesse Beu Matthew Mattina and Robert Mullins. 2019. Efficient Winograd or Cook-Toom convolution kernel implementation on widely used mobile CPUs. Retrieved from https:\/\/arXiv:1903.01521.  Partha Maji Andrew Mundy Ganesh Dasika Jesse Beu Matthew Mattina and Robert Mullins. 2019. Efficient Winograd or Cook-Toom convolution kernel implementation on widely used mobile CPUs. Retrieved from https:\/\/arXiv:1903.01521.","DOI":"10.1109\/EMC249363.2019.00008"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.3115\/1118853.1118871"},{"key":"e_1_2_1_42_1","volume-title":"Sedukhin","author":"Matsumoto Kazuya","year":"2012","unstructured":"Kazuya Matsumoto , Naohito Nakasato , and Stanislav G . Sedukhin . 2012 . Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE , 396--405. Kazuya Matsumoto, Naohito Nakasato, and Stanislav G. Sedukhin. 2012. Performance tuning of matrix multiplication in OpenCL on different GPUs and CPUs. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 396--405."},{"key":"e_1_2_1_43_1","unstructured":"Sharan Narang. [n.d.]. DeepBench. Retrieved from url https:\/\/github.com\/baidu-research\/DeepBench.  Sharan Narang. [n.d.]. DeepBench. Retrieved from url https:\/\/github.com\/baidu-research\/DeepBench."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3204919.3204924"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSoC.2015.10"},{"key":"e_1_2_1_46_1","unstructured":"Nvidia. 2020. cuBLAS. Basic Linear Algebra on NVIDIA GPUs. Retrieved from https:\/\/developer.nvidia.com\/cublas.  Nvidia. 2020. cuBLAS. Basic Linear Algebra on NVIDIA GPUs. Retrieved from https:\/\/developer.nvidia.com\/cublas."},{"key":"e_1_2_1_47_1","volume-title":"Five Balltree Construction Algorithms","author":"Omohundro Stephen M.","unstructured":"Stephen M. Omohundro . 1989. Five Balltree Construction Algorithms . International Computer Science Institute Berkeley . Stephen M. Omohundro. 1989. Five Balltree Construction Algorithms. International Computer Science Institute Berkeley."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/72.159058"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-24289-3_49"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917)","author":"Pfaffe P.","year":"2017","unstructured":"P. Pfaffe , M. Tillmann , S. Walter , and W. F. Tichy . 2017. Online-autotuning in the presence of algorithmic choice . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917) . 1379--1388. DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2017 .28 10.1109\/IPDPSW.2017.28 P. Pfaffe, M. Tillmann, S. Walter, and W. F. Tichy. 2017. Online-autotuning in the presence of algorithmic choice. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917). 1379--1388. DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2017.28"},{"key":"e_1_2_1_51_1","volume-title":"FFT-based 2D convolution. NVIDIA White Paper 32","author":"Podlozhnyuk Victor","year":"2007","unstructured":"Victor Podlozhnyuk . 2007. FFT-based 2D convolution. NVIDIA White Paper 32 ( 2007 ). Victor Podlozhnyuk. 2007. FFT-based 2D convolution. NVIDIA White Paper 32 (2007)."},{"key":"e_1_2_1_52_1","first-page":"2229","article-title":"Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation","volume":"2","author":"Powers David Martin","year":"2011","unstructured":"David Martin Powers . 2011 . Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation . J. Mach. Learn. Technol 2 (2011), 2229 -- 3981 . David Martin Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol 2 (2011), 2229--3981.","journal-title":"J. Mach. Learn. Technol"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4423"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the Workshop on Empirical Methods in Artificial Intelligence (IJCAI\u201901)","volume":"3","author":"\u00a0al Irina Rish","year":"2001","unstructured":"Irina Rish et \u00a0al . 2001 . An empirical study of the naive Bayes classifier . In Proceedings of the Workshop on Empirical Methods in Artificial Intelligence (IJCAI\u201901) , Vol. 3 . 41--46. Irina Rish et\u00a0al. 2001. An empirical study of the naive Bayes classifier. In Proceedings of the Workshop on Empirical Methods in Artificial Intelligence (IJCAI\u201901), Vol. 3. 41--46."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/21.97458"},{"key":"e_1_2_1_56_1","volume-title":"Machine Learning for Cyber Physical Systems","author":"Sailer Johannes","unstructured":"Johannes Sailer , Christian Frey , and Christian K\u00fchnert . 2019. GPU GEMM-kernel autotuning for scalable machine learners . In Machine Learning for Cyber Physical Systems . Springer , 66--76. Johannes Sailer, Christian Frey, and Christian K\u00fchnert. 2019. GPU GEMM-kernel autotuning for scalable machine learners. In Machine Learning for Cyber Physical Systems. Springer, 66--76."},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201900)","author":"Singer Bryan","year":"2000","unstructured":"Bryan Singer and Manuela Veloso . 2000 . Learning to predict performance from formula modeling and training data . In Proceedings of the International Conference on Machine Learning (ICML\u201900) . 887--894. Bryan Singer and Manuela Veloso. 2000. Learning to predict performance from formula modeling and training data. In Proceedings of the International Conference on Machine Learning (ICML\u201900). 887--894."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126971"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_61_1","doi-asserted-by":"crossref","unstructured":"Sanket Tavarageri Alexander Heinecke Sasikanth Avancha Gagandeep Goyal Ramakrishna Upadrasta and Bharat Kaul. 2020. PolyDL: Polyhedral optimizations for creation of high performance DL primitives. Retrieved from https:\/\/arXiv:2006.02230.  Sanket Tavarageri Alexander Heinecke Sasikanth Avancha Gagandeep Goyal Ramakrishna Upadrasta and Bharat Kaul. 2020. PolyDL: Polyhedral optimizations for creation of high performance DL primitives. Retrieved from https:\/\/arXiv:2006.02230.","DOI":"10.1145\/3433103"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126908.3126939"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.08.004"},{"key":"e_1_2_1_64_1","volume-title":"Seinstra","author":"Werkhoven Ben Van","year":"2014","unstructured":"Ben Van Werkhoven , Jason Maassen , Henri E. Bal , and Frank J . Seinstra . 2014 . Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst . 30 (Jan. 2014), 14--26. DOI:https:\/\/doi.org\/10.1016\/j.future.2013.09.003 10.1016\/j.future.2013.09.003 Ben Van Werkhoven, Jason Maassen, Henri E. Bal, and Frank J. Seinstra. 2014. Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst. 30 (Jan. 2014), 14--26. DOI:https:\/\/doi.org\/10.1016\/j.future.2013.09.003"},{"key":"e_1_2_1_65_1","unstructured":"Nicolas Vasilache Oleksandr Zinenko Theodoros Theodoridis Priya Goyal Zachary DeVito William S. Moses Sven Verdoolaege Andrew Adams and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. Retrieved from https:\/\/arXiv:1802.04730.  Nicolas Vasilache Oleksandr Zinenko Theodoros Theodoridis Priya Goyal Zachary DeVito William S. Moses Sven Verdoolaege Andrew Adams and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. Retrieved from https:\/\/arXiv:1802.04730."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2017.7995254"},{"key":"e_1_2_1_67_1","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27","author":"Clint Whaley R.","unstructured":"R. Clint Whaley and Jack J. Dongarra . 1998. Automatically tuned linear algebra software . In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27 . R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Computer Society, 1--27."},{"key":"e_1_2_1_68_1","first-page":"2001","article-title":"Automated empirical optimization of software and the ATLAS project","volume":"27","author":"Whaley R. Clint","year":"2000","unstructured":"R. Clint Whaley , Antoine Petitet , and Jack J. Dongarra . 2000 . Automated empirical optimization of software and the ATLAS project . Parallel Comput. 27 (2000), 2001 . R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2000. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27 (2000), 2001.","journal-title":"Parallel Comput."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-32820-6_85"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327757.3327928"},{"key":"e_1_2_1_71_1","unstructured":"Zhang Xianyi Wang Qian and Zaheer Chothia. 2014. Openblas. Retrieved from http:\/\/xianyi.github.io\/OpenBLAS.  Zhang Xianyi Wang Qian and Zaheer Chothia. 2014. Openblas. Retrieved from http:\/\/xianyi.github.io\/OpenBLAS."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330354"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/1645953.1646301"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178495"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.3390\/a12050112"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3229762.3229764"},{"key":"e_1_2_1_77_1","volume-title":"Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et\u00a0al.","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng , Chengfan Jia , Minmin Sun , Zhao Wu , Cody Hao Yu , Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et\u00a0al. 2020 . Ansor : Generating high-performance tensor programs for deep learning. Retrieved from https:\/\/arXiv:2006.06762. Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et\u00a0al. 2020. Ansor: Generating high-performance tensor programs for deep learning. Retrieved from https:\/\/arXiv:2006.06762."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3434402","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3434402","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:48Z","timestamp":1750195908000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3434402"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,7]]},"references-count":77,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,3,31]]}},"alternative-id":["10.1145\/3434402"],"URL":"https:\/\/doi.org\/10.1145\/3434402","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,1,7]]},"assertion":[{"value":"2020-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}