{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,5]],"date-time":"2025-07-05T11:42:48Z","timestamp":1751715768403,"version":"3.41.0"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T00:00:00Z","timestamp":1677628800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Bpifrance Programme d\u2019Investissements d\u2019Avenir (PIA) as part of the ES3CAP project"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            A wide range of scientific and machine learning applications depend on highly optimized implementations of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging task, due to the complexity of the microarchitectural features that come into play when seeking near-peak performance. Among the state-of-the-art techniques for loop transformations for performance optimization, AutoScheduler\u00a0[Zheng et\u00a0al.\n            <jats:xref ref-type=\"bibr\">2020a<\/jats:xref>\n            ] tends to outperform other systems. It often yields higher performance as compared to vendor libraries, but takes a large number of runs to converge, while also involving a complex training environment.\n          <\/jats:p>\n          <jats:p>In this article, we define a structured configuration space that enables much faster convergence to high-performance code versions, using only random sampling of candidates. We focus on two-dimensional convolutions on CPUs. Compared to state-of-the-art libraries, our structured search space enables higher performance for typical tensor shapes encountered in convolution stages in deep learning pipelines. Compared to auto-tuning code generators like AutoScheduler, it prunes the search space while increasing the density of efficient implementations. We analyze the impact on convergence speed and performance distribution, on two Intel x86 processors and one ARM AArch64 processor. We match or outperform the performance of the state-of-the-art oneDNN library and TVM\u2019s AutoScheduler, while reducing the autotuning effort by at least an order of magnitude.<\/jats:p>","DOI":"10.1145\/3570641","type":"journal-article","created":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T10:14:58Z","timestamp":1667902498000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Autotuning Convolutions Is Easier Than You Think"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5072-4109","authenticated-orcid":false,"given":"Nicolas","family":"Tollenaere","sequence":"first","affiliation":[{"name":"INRIA, Grenoble, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0326-1807","authenticated-orcid":false,"given":"Guillaume","family":"Iooss","sequence":"additional","affiliation":[{"name":"INRIA, Grenoble, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3950-5818","authenticated-orcid":false,"given":"St\u00e9phane","family":"Pouget","sequence":"additional","affiliation":[{"name":"University of California Los-Angeles, Los Angeles, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5875-4660","authenticated-orcid":false,"given":"Hugo","family":"Brunie","sequence":"additional","affiliation":[{"name":"INRIA, Grenoble, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8308-8682","authenticated-orcid":false,"given":"Christophe","family":"Guillon","sequence":"additional","affiliation":[{"name":"INRIA, Grenoble, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8866-5343","authenticated-orcid":false,"given":"Albert","family":"Cohen","sequence":"additional","affiliation":[{"name":"Google, Paris, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4737-2034","authenticated-orcid":false,"given":"P.","family":"Sadayappan","sequence":"additional","affiliation":[{"name":"University of Utah, Utah, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6589-9956","authenticated-orcid":false,"given":"Fabrice","family":"Rastello","sequence":"additional","affiliation":[{"name":"INRIA, Grenoble, France"}]}],"member":"320","published-online":{"date-parts":[[2023,3]]},"reference":[{"key":"e_1_3_2_2_1","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). USENIX Association, 265\u2013283."},{"key":"e_1_3_2_3_1","first-page":"193","volume-title":"IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919)","author":"Baghdadi Riyadh","year":"2019","unstructured":"Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919), Mahmut Taylan Kandemir, Alexandra Jimborean, and Tipp Moseley (Eds.). IEEE, 193\u2013205."},{"key":"e_1_3_2_4_1","doi-asserted-by":"publisher","DOI":"10.1287\/moor.19.4.769"},{"key":"e_1_3_2_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1025992"},{"key":"e_1_3_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3033019.3033023"},{"key":"e_1_3_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_3_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD46524.2019.00053"},{"key":"e_1_3_2_9_1","first-page":"578","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association, 578\u2013594."},{"key":"e_1_3_2_10_1","first-page":"3389","article-title":"Learning to optimize tensor programs","volume":"31","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31 (2018), 3389\u20133400.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/223428.207162"},{"key":"e_1_3_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3211346.3211354"},{"key":"e_1_3_2_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01407835"},{"key":"e_1_3_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01379404"},{"key":"e_1_3_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3559009.3569682"},{"key":"e_1_3_2_16_1","unstructured":"Google. [n. d.]. XLA: Optimiser le compilateur pour le machine learning. https:\/\/www.tensorflow.org\/xla?hl=fr."},{"key":"e_1_3_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1377603.1377607"},{"issue":"4","key":"e_1_3_2_18_1","article-title":"Polly - performing polyhedral optimizations on a low-level intermediate representation","volume":"22","author":"Grosser Tobias","year":"2012","unstructured":"Tobias Grosser, Armin Gr\u00f6\u00dflinger, and Christian Lengauer. 2012. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letter 22, 4 (2012), 1250010\u20131\u201328.","journal-title":"Parallel Processing Letter"},{"key":"e_1_3_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2743016"},{"key":"e_1_3_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.83"},{"key":"e_1_3_2_22_1","unstructured":"Intel. 2018. oneAPI deep neural network library (oneDNN). https:\/\/01.org\/."},{"key":"e_1_3_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356218"},{"key":"e_1_3_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446759"},{"key":"e_1_3_2_25_1","doi-asserted-by":"publisher","DOI":"10.1137\/16M108968X"},{"key":"e_1_3_2_26_1","unstructured":"NVIDIA. 2018. CuDNN: GPU Accelerated Deep Learning. https:\/\/developer.nvidia.com\/cudnn."},{"key":"e_1_3_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.690"},{"key":"e_1_3_2_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-49051-7_12"},{"key":"e_1_3_2_30_1","unstructured":"Paul Springer and Paolo Bientinesi. 2016. Design of a high-performance GEMM-like Tensor-Tensor Multiplication. arxiv:1607.00145."},{"issue":"3","key":"e_1_3_2_31_1","first-page":"14","article-title":"BLIS: A framework for rapidly instantiating BLAS functionality","volume":"41","author":"Zee Field G. Van","year":"2015","unstructured":"Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Software 41, 3, Article 14 (June2015), 33 pages.","journal-title":"ACM Trans. Math. Software"},{"key":"e_1_3_2_32_1","unstructured":"Nicolas Vasilache Oleksandr Zinenko Theodoros Theodoridis Priya Goyal Zachary DeVito William S. Moses Sven Verdoolaege Andrew Adams and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arxiv:1802.04730 [cs.PL]."},{"key":"e_1_3_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400713"},{"key":"e_1_3_2_34_1","first-page":"167","volume-title":"Intel Math Kernel Library","author":"Wang Endong","year":"2014","unstructured":"Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel Math Kernel Library. Intel, 167\u2013188."},{"key":"e_1_3_2_35_1","first-page":"863","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920)","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020a. Ansor: Generating high-performance tensor programs for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201920). USENIX Association, 863\u2013879. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/zheng."},{"key":"e_1_3_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378508"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3570641","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3570641","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:34Z","timestamp":1750182574000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3570641"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":35,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3570641"],"URL":"https:\/\/doi.org\/10.1145\/3570641","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2022-06-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}