{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T22:26:26Z","timestamp":1766269586515,"version":"3.41.0"},"reference-count":26,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,2,10]],"date-time":"2023-02-10T00:00:00Z","timestamp":1675987200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>\n            This article introduces\n            <jats:monospace>YaConv<\/jats:monospace>\n            , a new algorithm to compute convolution using\n            <jats:monospace>GEMM<\/jats:monospace>\n            microkernels from a Basic Linear Algebra Subprograms library that is efficient for multiple CPU architectures. Previous approaches either create a copy of each image element for each filter element or reload these elements into cache for each\n            <jats:monospace>GEMM<\/jats:monospace>\n            call, leading to redundant instances of the image elements in cache. Instead,\n            <jats:monospace>YaConv<\/jats:monospace>\n            loads each image element once into the cache and maximizes the reuse of these elements. The output image is computed by scattering results of the\n            <jats:monospace>GEMM<\/jats:monospace>\n            microkernel calls to the correct locations in the output image. The main advantage of this new algorithm\u2014which leads to better performance in comparison to the existing\n            <jats:monospace>im2col<\/jats:monospace>\n            approach on several architectures\u2014is a more efficient use of the memory hierarchy. The experimental evaluation on convolutional layers from PyTorch, along with a parameterized study, indicates an average 24% speedup over\n            <jats:monospace>im2col<\/jats:monospace>\n            convolution. Increased performance comes as a result of 3\u00d7 reduction in L3 cache accesses and 2\u00d7 fewer branch instructions.\n          <\/jats:p>","DOI":"10.1145\/3570305","type":"journal-article","created":{"date-parts":[[2022,12,2]],"date-time":"2022-12-02T13:42:12Z","timestamp":1669988532000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["YaConv: Convolution with Low Cache Footprint"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3451-1920","authenticated-orcid":false,"given":"Ivan","family":"Korostelev","sequence":"first","affiliation":[{"name":"University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3476-184X","authenticated-orcid":false,"given":"Jo\u00e3o P.","family":"L. De Carvalho","sequence":"additional","affiliation":[{"name":"University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7029-6327","authenticated-orcid":false,"given":"Jos\u00e9","family":"Moreira","sequence":"additional","affiliation":[{"name":"IBM Research, Yorktown Heights, NY, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9943-1809","authenticated-orcid":false,"given":"Jos\u00e9 Nelson","family":"Amaral","sequence":"additional","affiliation":[{"name":"University of Alberta, Edmonton, AB, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,2,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD49847.2020.00024"},{"key":"e_1_3_2_3_2","article-title":"Convolution Algorithms","author":"Burrus C. Sidney","year":"1985","unstructured":"C. Sidney Burrus and T. Parks. 1985. Convolution Algorithms. Citeseer, New York, NY.","journal-title":"Citeseer, New York, NY"},{"key":"e_1_3_2_4_2","volume-title":"Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition","author":"Chellapilla Kumar","year":"2006","unstructured":"Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Universit\u00e9 de Rennes 1, Suvisoft, La Baule (France). Retrieved from https:\/\/hal.inria.fr\/inria-00112631; http:\/\/www.suvisoft.com."},{"key":"e_1_3_2_5_2","first-page":"815","volume-title":"Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research)","volume":"70","author":"Cho Minsik","year":"2017","unstructured":"Minsik Cho and Daniel Brand. 2017. MEC: Memory-efficient convolution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 815\u2013824. Retrieved from https:\/\/proceedings.mlr.press\/v70\/cho17a.html."},{"key":"e_1_3_2_6_2","unstructured":"Intel Corporation. 2016. oneAPI Deep Neural Network Library (oneDNN). Retrieved from https:\/\/github.com\/oneapi-src\/oneDNN."},{"key":"e_1_3_2_7_2","doi-asserted-by":"crossref","unstructured":"Marat Dukhan. 2019. The indirect convolution algorithm. Retrieved from https:\/\/arXiv:1907.02129.","DOI":"10.1109\/IPDPSW50202.2020.00154"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00069"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00032"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_3_2_11_2","unstructured":"Yangqing Jia Evan Shelhamer Jeff Donahue Sergey Karayev Jonathan Long Ross Girshick Sergio Guadarrama and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Retrieved from https:\/\/arXiv:1408.5093."},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Pablo San Juan Adri\u00e1n Castell\u00f3 M. F. Dolz P. Alonso-Jord\u00e1 and E. S. Quintana-Ort\u00ed. 2020. High performance and portable convolution operators for ARM-based multicore processors. Retrieved from https:\/\/abs\/2005.06410.","DOI":"10.1109\/SBAC-PAD49847.2020.00023"},{"key":"e_1_3_2_13_2","volume-title":"LLVM: An Infrastructure for Multi-Stage Optimization","author":"Lattner Chris","year":"2002","unstructured":"Chris Lattner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Master\u2019s Thesis. Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL. Retrieved from http:\/\/llvm.cs.uiuc.edu."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446759"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2925987"},{"key":"e_1_3_2_18_2","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, 8024\u20138035. Retrieved from http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf."},{"key":"e_1_3_2_19_2","first-page":"1","article-title":"FFT-based 2D convolution","volume":"32","author":"Podlozhnyuk Victor","year":"2007","unstructured":"Victor Podlozhnyuk. 2007. FFT-based 2D convolution. NVIDIA White Paper 32 (2007), 1.","journal-title":"NVIDIA White Paper"},{"key":"e_1_3_2_20_2","unstructured":"Steven W Smith et\u00a0al. 1997. The Scientist and Engineer\u2019s Guide to Digital Signal Processing."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3532863"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3433103"},{"key":"e_1_3_2_24_2","unstructured":"Vincent M. Weaver. 2013. Linux Perf Event Features and Overhead. (2013)."},{"key":"e_1_3_2_25_2","first-page":"684","article-title":"Model-driven level 3 BLAS performance optimization on Loongson 3A processor","author":"Xianyi Zhang","year":"2012","unstructured":"Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems (2012), 684\u2013691.","journal-title":"Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3177885"},{"key":"e_1_3_2_27_2","volume-title":"Ansor: Generating High-Performance Tensor Programs for Deep Learning","author":"Zheng Lianmin","year":"2020","unstructured":"Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. USENIX Association."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3570305","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3570305","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:38Z","timestamp":1750182578000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3570305"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,10]]},"references-count":26,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3570305"],"URL":"https:\/\/doi.org\/10.1145\/3570305","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,2,10]]},"assertion":[{"value":"2022-05-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-19","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}