{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T03:13:54Z","timestamp":1768533234685,"version":"3.49.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,14]],"date-time":"2023-12-14T00:00:00Z","timestamp":1702512000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>Convolution is one of the most computationally intensive operations that must be performed for machine learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This article proposes SConv: a direct-convolution algorithm based on an MLIR\/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA)\u2014a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization\u2014a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-based Packing\u2014an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3\u00d7\u20134.0\u00d7 on Intel x86 and 3.3\u00d7\u20135.9\u00d7 on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11%\u201327% for Intel x86 and 11%\u201334% for IBM POWER10 architectures. The total convolution speedup for model inference is 13%\u201328% on Intel x86 and 23%\u201339% on IBM POWER10. SConv\u00a0also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.<\/jats:p>","DOI":"10.1145\/3625004","type":"journal-article","created":{"date-parts":[[2023,9,20]],"date-time":"2023-09-20T11:25:49Z","timestamp":1695209149000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2051-2877","authenticated-orcid":false,"given":"Victor","family":"Ferrari","sequence":"first","affiliation":[{"name":"Institute of Computing-UNICAMP, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2673-6601","authenticated-orcid":false,"given":"Rafael","family":"Sousa","sequence":"additional","affiliation":[{"name":"Institute of Computing-UNICAMP, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1140-4513","authenticated-orcid":false,"given":"Marcio","family":"Pereira","sequence":"additional","affiliation":[{"name":"Institute of Computing-UNICAMP, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3476-184X","authenticated-orcid":false,"given":"Jo\u00e3o P.","family":"L. De Carvalho","sequence":"additional","affiliation":[{"name":"University of Alberta, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9943-1809","authenticated-orcid":false,"given":"Jos\u00e9 Nelson","family":"Amaral","sequence":"additional","affiliation":[{"name":"University of Alberta, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7029-6327","authenticated-orcid":false,"given":"Jos\u00e9","family":"Moreira","sequence":"additional","affiliation":[{"name":"IBM Research, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4869-5190","authenticated-orcid":false,"given":"Guido","family":"Araujo","sequence":"additional","affiliation":[{"name":"Institute of Computing\u2013UNICAMP, Brazil"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,14]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD49847.2020.00024"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2022.102806"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2022.3176529"},{"key":"e_1_3_1_5_2","volume-title":"Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition","author":"Chellapilla Kumar","year":"2006","unstructured":"Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Universit\u00e9 de Rennes 1, Suvisoft, La Baule (France). Retrieved from https:\/\/hal.inria.fr\/inria-00112631"},{"key":"e_1_3_1_6_2","first-page":"579","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association, 579\u2013594."},{"key":"e_1_3_1_7_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan M. Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. Retrieved from https:\/\/abs\/1410.0759"},{"key":"e_1_3_1_8_2","first-page":"815","volume-title":"Proceedings of the 34th International Conference on Machine Learning (ICML\u201917)","author":"Cho Minsik","year":"2017","unstructured":"Minsik Cho and Daniel Brand. 2017. MEC: Memory-efficient convolution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning (ICML\u201917). JMLR.org, 815\u2013824."},{"key":"e_1_3_1_9_2","unstructured":"Marat Dukhan. 2019. The indirect convolution algorithm. Retrieved from https:\/\/abs\/1907.02129"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_3_1_11_2","article-title":"Eigen v3","author":"Guennebaud Ga\u00ebl","year":"2010","unstructured":"Ga\u00ebl Guennebaud, Beno\u00eet Jacob et\u00a0al. 2010. Eigen v3. Retrieved from http:\/\/eigen.tuxfamily.org","journal-title":"R"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_13_2","unstructured":"Forrest N. Iandola Matthew W. Moskewicz Khalid Ashraf Song Han William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. Retrieved from https:\/\/abs\/1602.07360"},{"key":"e_1_3_1_14_2","unstructured":"Intel Corporation 2022. oneAPI Specification . Intel Corporation. Retrieved May 19th 2023 from https:\/\/spec.oneapi.io\/versions\/latest\/index.html"},{"key":"e_1_3_1_15_2","first-page":"448","volume-title":"Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915)","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915). JMLR.org, 448\u2013456."},{"key":"e_1_3_1_16_2","article-title":"Caffe: Convolutional architecture for fast feature embedding","author":"Jia Yangqing","year":"2014","unstructured":"Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Retrieved from https:\/\/arXiv:1408.5093","journal-title":"R"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10586-021-03494-y"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD49847.2020.00023"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3570305"},{"key":"e_1_3_1_20_2","unstructured":"Tung D. Le Gheorghe-Teodor Bercea Tong Chen Alexandre E. Eichenberger Haruki Imai Tian Jin Kiyokuni Kawachiya Yasushi Negishi and Kevin O\u2019Brien. 2020. Compiling ONNX neural network models using MLIR. Retrieved from https:\/\/abs\/2008.08272"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446759"},{"key":"e_1_3_1_22_2","unstructured":"Jos\u00e9 E. Moreira Kit Barton Steven Battle Peter Bergner Ramon Bertran Puneeth Bhat Pedro Caldeira David Edelsohn Gordon Fossum Brad Frey Nemanja Ivanovic Chip Kerchner Vincent Lim Shakti Kapoor Tulio Machado Filho Silvia Melitta Mueller Brett Olsson Satish Sadasivam Baptiste Saleil Bill Schmidt Rajalakshmi Srinivasaraghavan Shricharan Srivatsan Brian W. Thompto Andreas Wagner and Nelson Wu. 2021. A matrix math facility for Power ISA(TM) processors. Retrieved from https:\/\/arxiv.org\/abs\/2104.03142"},{"key":"e_1_3_1_23_2","unstructured":"ONNX 2019. ONNX Model Zoo . Retrieved June 26th 2022 from https:\/\/github.com\/onnx\/models"},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1145\/3498361.3538940","volume-title":"Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services","author":"Park Jongseok","year":"2022","unstructured":"Jongseok Park, Kyungmin Bin, and Kyunghan Lee. 2022. mGEMM: Low-latency convolution with minimal memory overhead optimized for mobile devices. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 222\u2013234."},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1007\/978-3-030-72789-5_9","volume-title":"Proceedings of the 32nd International Workshop on Languages and Compilers for Parallel Computing (LCPC\u201919)","author":"Patabandi Tharindu R.","year":"2021","unstructured":"Tharindu R. Patabandi, Anand Venkat, Rajkishore Barik, and Mary Hall. 2021. SWIRL++: Evaluating performance models to guide code transformation in convolutional neural networks. In Proceedings of the 32nd International Workshop on Languages and Compilers for Parallel Computing (LCPC\u201919). Springer, 108\u2013126."},{"key":"e_1_3_1_26_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Retrieved from https:\/\/abs\/1409.1556"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD53543.2021.00020"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3570641"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_3_1_31_2","unstructured":"Nicolas Vasilache Oleksandr Zinenko Aart J. C. Bik Mahesh Ravishankar Thomas Raoux Alexander Belyaev Matthias Springer Tobias Gysi Diego Caballero Stephan Herhut Stella Laurenzo and Albert Cohen. 2022. Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction. Retrieved from https:\/\/arxiv:2202.03293"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2017.7995254"},{"key":"e_1_3_1_33_2","first-page":"43","volume-title":"Proceedings of the IEEE International Conference on Joint Cloud Computing","author":"Wang Qinglin","year":"2020","unstructured":"Qinglin Wang, Dongsheng Li, Songzhu Mei, Siqi Shen, and Xiandong Huang. 2020. Optimizing one by one direct convolution on armv8 multi-core cpus. In Proceedings of the IEEE International Conference on Joint Cloud Computing. IEEE, 43\u201347."},{"key":"e_1_3_1_34_2","unstructured":"Zhang Xianyi Martin Kroeker Werner Saar Wang Qian Zaheer Chothia Chen Shaohu and Luo Wen. [n.d.]. OpenBLAS: An Optimized BLAS Library. Retrieved from https:\/\/www.openblas.net\/"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374520"},{"key":"e_1_3_1_36_2","series-title":"Proceedings of the 35th International Conference on Machine Learning","first-page":"5776","volume":"80","author":"Zhang Jiyuan","year":"2018","unstructured":"Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High performance zero-memory overhead direct convolutions. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5776\u20135785. Retrieved from http:\/\/proceedings.mlr.press\/v80\/zhang18d.html"},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201920)","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201920). USENIX Association, Article 49, 17 pages."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3625004","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3625004","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:48Z","timestamp":1750178808000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3625004"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,14]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3625004"],"URL":"https:\/\/doi.org\/10.1145\/3625004","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,14]]},"assertion":[{"value":"2023-02-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}