{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T22:23:24Z","timestamp":1767997404392,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,19]],"date-time":"2024-01-19T00:00:00Z","timestamp":1705622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nd\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2021ZD0110101"],"award-info":[{"award-number":["2021ZD0110101"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62090024, 62232015, and 62302479"],"award-info":[{"award-number":["62090024, 62232015, and 62302479"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"crossref","award":["2023M733566"],"award-info":[{"award-number":["2023M733566"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Innovation Funding of ICT, CAS","award":["E361010"],"award-info":[{"award-number":["E361010"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this article, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84\u00d7 and 1.91\u00d7 operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.<\/jats:p>","DOI":"10.1145\/3632956","type":"journal-article","created":{"date-parts":[[2023,11,17]],"date-time":"2023-11-17T12:12:18Z","timestamp":1700223138000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7835-113X","authenticated-orcid":false,"given":"Xueying","family":"Wang","sequence":"first","affiliation":[{"name":"Beijing University of Posts and Telecommunications, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9738-261X","authenticated-orcid":false,"given":"Guangli","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3543-2324","authenticated-orcid":false,"given":"Zhen","family":"Jia","sequence":"additional","affiliation":[{"name":"Amazon Web Services, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2909-7750","authenticated-orcid":false,"given":"Xiaobing","family":"Feng","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8165-840X","authenticated-orcid":false,"given":"Yida","family":"Wang","sequence":"additional","affiliation":[{"name":"Amazon Web Services, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,1,19]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"582","volume-title":"Proceedings of the International Symposium on Microarchitecture","author":"Andri Renzo","year":"2022","unstructured":"Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, and Zhe Wang. 2022. Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles. In Proceedings of the International Symposium on Microarchitecture. IEEE, 582\u2013598."},{"issue":"4","key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3412380","article-title":"Error analysis and improving the accuracy of Winograd convolution for deep neural networks","volume":"46","author":"Barabasz Barbara","year":"2020","unstructured":"Barbara Barabasz, Andrew Anderson, Kirk M. Soodhalter, and David Gregg. 2020. Error analysis and improving the accuracy of Winograd convolution for deep neural networks. ACM Trans. Math. Softw. 46, 4 (2020), 1\u201333.","journal-title":"ACM Trans. Math. Softw."},{"key":"e_1_3_1_4_2","first-page":"5918","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Cai Zhaowei","year":"2017","unstructured":"Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5918\u20135926."},{"key":"e_1_3_1_5_2","first-page":"291","article-title":"Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems","volume":"2","author":"Chen Beidi","year":"2020","unstructured":"Beidi Chen, Tharun Medini, James Farwell, Charlie Tai, Anshumali Shrivastava, et\u00a0al. 2020. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proc. Mach. Learn. Syst. 2 (2020), 291\u2013306.","journal-title":"Proc. Mach. Learn. Syst."},{"issue":"1","key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1631\/FITEE.1700789","article-title":"Recent advances in efficient computation of deep convolutional neural networks","volume":"19","author":"Cheng Jian","year":"2018","unstructured":"Jian Cheng, Pei-song Wang, Gang Li, Qing-hao Hu, and Han-qing Lu. 2018. Recent advances in efficient computation of deep convolutional neural networks. Front. Inf. Technol. Electr. Eng. 19, 1 (2018), 64\u201377.","journal-title":"Front. Inf. Technol. Electr. Eng."},{"key":"e_1_3_1_7_2","first-page":"12507","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chikin Vladimir","year":"2022","unstructured":"Vladimir Chikin and Vladimir Kryzhanovskiy. 2022. Channel balancing for accurate quantization of winograd convolutions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 12507\u201312516."},{"key":"e_1_3_1_8_2","first-page":"3009","volume-title":"Proceedings of the ICCV Workshops","author":"Choukroun Yoni","year":"2019","unstructured":"Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. 2019. Low-bit quantization of neural networks for efficient inference. In Proceedings of the ICCV Workshops. 3009\u20133018."},{"key":"e_1_3_1_9_2","article-title":"Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or \u20131","author":"Courbariaux Matthieu","year":"2016","unstructured":"Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or \u20131. arXiv:1602.02830. Retrieved from https:\/\/arxiv.org\/abs\/1602.02830","journal-title":"arXiv:1602.02830"},{"key":"e_1_3_1_10_2","first-page":"156","article-title":"Accelerating slide deep learning on modern cpus: Vectorization, quantizations, memory optimizations, and more","volume":"3","author":"Daghaghi Shabnam","year":"2021","unstructured":"Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, and Anshumali Shrivastava. 2021. Accelerating slide deep learning on modern cpus: Vectorization, quantizations, memory optimizations, and more. Proc. Mach. Learn. Syst. 3 (2021), 156\u2013166.","journal-title":"Proc. Mach. Learn. Syst."},{"key":"e_1_3_1_11_2","first-page":"248","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Deng Jia","year":"2009","unstructured":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248\u2013255."},{"key":"e_1_3_1_12_2","article-title":"Training dnns with hybrid block floating point","volume":"31","author":"Drumond Mario","year":"2018","unstructured":"Mario Drumond, Tao Lin, Martin Jaggi, and Babak Falsafi. 2018. Training dnns with hybrid block floating point. Adv. Neural Inf. Process. Syst. 31 (2018).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_13_2","first-page":"1","volume-title":"Proceedings of the Machine Learning and Systems Conference","author":"Fern\u00e1ndez-Marqu\u00e9s Javier","year":"2020","unstructured":"Javier Fern\u00e1ndez-Marqu\u00e9s, Paul N. Whatmough, Andrew Mundy, and Matthew Mattina. 2020. Searching for winograd-aware quantized networks. In Proceedings of the Machine Learning and Systems Conference. 1\u201316."},{"key":"e_1_3_1_14_2","first-page":"1","volume-title":"Proceedings of the International Conference on Parallel Processing","author":"Gaungli Li","year":"2021","unstructured":"Li Gaungli, Zhen Jia, Xiaobing Feng, and Yida Wang. 2021. LoWino: Towards efficient low-precision winograd convolutions on modern CPUs. In Proceedings of the International Conference on Parallel Processing. 1\u201311."},{"key":"e_1_3_1_15_2","first-page":"796","volume-title":"Proceedings of the International Symposium on Microarchitecture","author":"Gong Zhangxiaowen","year":"2020","unstructured":"Zhangxiaowen Gong, Houxiang Ji, Christopher W. Fletcher, Christopher J. Hughes, Sara Baghsorkhi, and Josep Torrellas. 2020. Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus. In Proceedings of the International Symposium on Microarchitecture. IEEE, 796\u2013810."},{"key":"e_1_3_1_16_2","first-page":"12175","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922)","author":"Guo Jianyuan","year":"2022","unstructured":"Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. 2022. CMT: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922). 12175\u201312185."},{"key":"e_1_3_1_17_2","article-title":"A survey on methods and theories of quantized neural networks","author":"Guo Yunhui","year":"2018","unstructured":"Yunhui Guo. 2018. A survey on methods and theories of quantized neural networks. arXiv:1808.04752. Retrieved from https:\/\/arxiv.org\/abs\/1808.04752","journal-title":"arXiv:1808.04752"},{"key":"e_1_3_1_18_2","article-title":"Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding","author":"Han Song","year":"2015","unstructured":"Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https:\/\/arxiv.org\/abs\/1510.00149","journal-title":"arXiv:1510.00149"},{"key":"e_1_3_1_19_2","first-page":"770","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770\u2013778."},{"issue":"8","key":"e_1_3_1_20_2","first-page":"3594","article-title":"Asymptotic soft filter pruning for deep convolutional neural networks","volume":"50","author":"He Yang","year":"2019","unstructured":"Yang He, Xuanyi Dong, Guoliang Kang, Yanwei Fu, Chenggang Yan, and Yi Yang. 2019. Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Trans. Cybernet. 50, 8 (2019), 3594\u20133604.","journal-title":"IEEE Trans. Cybernet."},{"key":"e_1_3_1_21_2","first-page":"1389","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"He Yihui","year":"2017","unstructured":"Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389\u20131397."},{"key":"e_1_3_1_22_2","first-page":"4174","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Huang Di","year":"2020","unstructured":"Di Huang, Xishan Zhang, Rui Zhang, Tian Zhi, Deyuan He, Jiaming Guo, Chang Liu, Qi Guo, Zidong Du, Shaoli Liu, et\u00a0al. 2020. DWM: A decomposable winograd method for convolution acceleration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4174\u20134181."},{"key":"e_1_3_1_23_2","volume-title":"Intrinsics Guide","year":"2021","unstructured":"Intel. 2021. Intrinsics Guide. Retrieved March 29, 2021 from https:\/\/software.intel.com\/sites\/landingpage\/IntrinsicsGuide\/"},{"key":"e_1_3_1_24_2","volume-title":"Introduction to Intel Deep Learning Boost on Second Generation Intel Xeon Scalable Processors","year":"2021","unstructured":"Intel. 2021. Introduction to Intel Deep Learning Boost on Second Generation Intel Xeon Scalable Processors. Retrieved March 24, 2021 from https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/articles\/introduction-to-intel-deep-learning-boost-on-second-generation-intel-xeon-scalable.html"},{"key":"e_1_3_1_25_2","volume-title":"oneAPI Deep Neural Network Library (oneDNN)","year":"2021","unstructured":"Intel. 2021. oneAPI Deep Neural Network Library (oneDNN). Retrieved February 27, 2021 from https:\/\/github.com\/oneapi-src\/oneDNN"},{"key":"e_1_3_1_26_2","first-page":"2704","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Jacob Benoit","year":"2018","unstructured":"Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704\u20132713."},{"issue":"7","key":"e_1_3_1_27_2","first-page":"986","article-title":"Enabling efficient fast convolution algorithms on GPUs via MegaKernels","volume":"69","author":"Jia Liancheng","year":"2020","unstructured":"Liancheng Jia, Yun Liang, Xiuhong Li, Liqiang Lu, and Shengen Yan. 2020. Enabling efficient fast convolution algorithms on GPUs via MegaKernels. IEEE Trans. Comput. 69, 7 (2020), 986\u2013997.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_28_2","first-page":"47","volume-title":"Proceedings of the ACM Symposium on Operating Systems Principles","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the ACM Symposium on Operating Systems Principles. 47\u201362."},{"key":"e_1_3_1_29_2","first-page":"109","volume-title":"Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"Jia Zhen","year":"2018","unstructured":"Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 109\u2013123."},{"key":"e_1_3_1_30_2","unstructured":"Zhen Jia Aleksandar Zlateski Fredo Durand and Kai Li. 2018. Towards optimal winograd convolution on manycores. Proceedings of Machine Learning and Systems 1\u20133."},{"key":"e_1_3_1_31_2","article-title":"Flexpoint: An adaptive numerical format for efficient training of deep neural networks","volume":"30","author":"K\u00f6ster Urs","year":"2017","unstructured":"Urs K\u00f6ster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, et\u00a0al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. Adv. Neural Inf. Process. Syst. 30 (2017).","journal-title":"Adv. Neural Inf. Process. Syst."},{"issue":"8","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"151","DOI":"10.3390\/computers12080151","article-title":"Convolutional neural networks: A survey","volume":"12","author":"Krichen Moez","year":"2023","unstructured":"Moez Krichen. 2023. Convolutional neural networks: A survey. Computers 12, 8 (2023), 151.","journal-title":"Computers"},{"key":"e_1_3_1_33_2","article-title":"Quantizing deep convolutional networks for efficient inference: A whitepaper","author":"Krishnamoorthi Raghuraman","year":"2018","unstructured":"Raghuraman Krishnamoorthi. 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv:1806.08342. Retrieved from https:\/\/arxiv.org\/abs\/1806.08342","journal-title":"arXiv:1806.08342"},{"key":"e_1_3_1_34_2","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097\u20131105.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_35_2","volume-title":"Information Theory and Statistics","author":"Kullback Solomon","year":"1997","unstructured":"Solomon Kullback. 1997. Information Theory and Statistics. Courier Corporation."},{"key":"e_1_3_1_36_2","volume-title":"wincnn","author":"Lavin Andrew","year":"2021","unstructured":"Andrew Lavin. 2021. wincnn. Retrieved February 27, 2021 from https:\/\/github.com\/andravin\/wincnn"},{"key":"e_1_3_1_37_2","first-page":"4013","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Lavin Andrew","year":"2016","unstructured":"Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013\u20134021."},{"key":"e_1_3_1_38_2","first-page":"159","volume-title":"Proceedings of the IEEE Intl Conf on Parallel & Distributed Processing with Applications","author":"Li Chendi","year":"2021","unstructured":"Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, and Yunquan Zhang. 2021. Autotsmm: An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on cpus. In Proceedings of the IEEE Intl Conf on Parallel & Distributed Processing with Applications. IEEE, 159\u2013166."},{"key":"e_1_3_1_39_2","first-page":"1","volume-title":"Proceedings of the International Conference on Parallel Processing","author":"Li Dongsheng","year":"2021","unstructured":"Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu. 2021. Optimizing massively parallel winograd convolution on ARM processor. In Proceedings of the International Conference on Parallel Processing. 1\u201312."},{"key":"e_1_3_1_40_2","first-page":"3842","volume-title":"Proceedings of the International Conference on Acoustics, Speech and Signal Processing","author":"Li Guangli","year":"2020","unstructured":"Guangli Li, Lei Liu, Xueying Wang, Xiu Ma, and Xiaobing Feng. 2020. Lance: Efficient low-precision quantized winograd convolution for neural networks based on graphics processing units. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing. IEEE, 3842\u20133846."},{"key":"e_1_3_1_41_2","first-page":"90","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization","author":"Li Guangli","year":"2021","unstructured":"Guangli Li, Jingling Xue, Lei Liu, Xueying Wang, Xiu Ma, Xiao Dong, Jiansong Li, and Xiaobing Feng. 2021. Unleashing the low-precision computation potential of tensor cores on GPUs. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 90\u2013102."},{"key":"e_1_3_1_42_2","first-page":"1","volume-title":"Proceedings of the International Conference on Parallel Processing","author":"Liu Junhong","year":"2021","unstructured":"Junhong Liu, Dongxu Yang, and Junjie Lai. 2021. Optimizing Winograd-based convolution with tensor cores. In Proceedings of the International Conference on Parallel Processing. 1\u201310."},{"key":"e_1_3_1_43_2","first-page":"1025","volume-title":"Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC \u201919)","author":"Liu Yizhi","year":"2019","unstructured":"Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2019. Optimizing CNN model inference on cpus. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC \u201919). 1025\u20131040."},{"key":"e_1_3_1_44_2","first-page":"1","volume-title":"Proceedings of the European Conference on Computer Systems","author":"Mazaheri Arya","year":"2020","unstructured":"Arya Mazaheri, Tim Beringer, Matthew Moskewicz, Felix Wolf, and Ali Jannesari. 2020. Accelerating winograd convolutions using symbolic computation and meta-programming. In Proceedings of the European Conference on Computer Systems. 1\u201314."},{"key":"e_1_3_1_45_2","unstructured":"Paulius Micikevicius Dusan Stosic Neil Burgess Marius Cornea Pradeep Dubey Richard Grisenthwaite Sangwon Ha Alexander Heinecke Patrick Judd John Kamalu Naveen Mellempudi Stuart Oberman Mohammad Shoeybi Michael Siu and Hao Wu. 2022. FP8 formats for deep learning. arxiv:2209.05433 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2209.05433"},{"key":"e_1_3_1_46_2","first-page":"5","volume-title":"Proceedings of the GPU Technology Conference","volume":"2","author":"Migacz Szymon","year":"2017","unstructured":"Szymon Migacz. 2017. 8-bit inference with tensorrt. In Proceedings of the GPU Technology Conference, Vol. 2. 5."},{"key":"e_1_3_1_47_2","volume-title":"CUDA C++ Programming Guide","year":"2021","unstructured":"NVIDIA. 2021. CUDA C++ Programming Guide. Retrieved March 29, 2021 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_3_1_48_2","first-page":"608","volume-title":"Proceedings of the 15th European Conference on Computer Vision (ECCV \u201918)","author":"Park Eunhyeok","year":"2018","unstructured":"Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the 15th European Conference on Computer Vision (ECCV \u201918). 608\u2013624."},{"key":"e_1_3_1_49_2","first-page":"8024","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024\u20138035."},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Tran Minh Quan David Grant Colburn Hildebrand and Won-Ki Jeong. 2021. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. Frontiers in Computer Science. 3 (2021) 613981.","DOI":"10.3389\/fcomp.2021.613981"},{"key":"e_1_3_1_51_2","article-title":"Yolov3: An incremental improvement","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv:1804.02767. Retrieved from https:\/\/arxiv.org\/abs\/1804.02767","journal-title":"arXiv:1804.02767"},{"key":"e_1_3_1_52_2","first-page":"234","volume-title":"International Conference on Medical Image Computing and Computer-Assisted Intervention","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234\u2013241."},{"key":"e_1_3_1_53_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, 1\u201314."},{"key":"e_1_3_1_54_2","first-page":"1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Szegedy Christian","year":"2015","unstructured":"Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1\u20139."},{"key":"e_1_3_1_55_2","volume-title":"ncnn","year":"2021","unstructured":"Tencent. 2021. ncnn. Retrieved February 27, 2021 from https:\/\/github.com\/Tencent\/ncnn"},{"key":"e_1_3_1_56_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Wang Yida","year":"2015","unstructured":"Yida Wang, Michael J. Anderson, Jonathan D. Cohen, Alexander Heinecke, Kai Li, Nadathur Satish, Narayanan Sundaram, Nicholas B. Turk-Browne, and Theodore L. Willke. 2015. Full correlation matrix analysis of fMRI data on Intel Xeon Phi coprocessors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201312."},{"key":"e_1_3_1_57_2","first-page":"77","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization","author":"Weng Jian","year":"2021","unstructured":"Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony Nowatzki. 2021. UNIT: Unifying tensorized instruction compilation. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 77\u201389."},{"key":"e_1_3_1_58_2","doi-asserted-by":"crossref","DOI":"10.1137\/1.9781611970364","volume-title":"Arithmetic Complexity of Computations","author":"Winograd Shmuel","year":"1980","unstructured":"Shmuel Winograd. 1980. Arithmetic Complexity of Computations. Vol. 33. SIAM."},{"key":"e_1_3_1_59_2","first-page":"53","volume-title":"Proceedings of the ACM SIGOPS Asia-Pacific Workshop on Systems","author":"Xie Dedong","year":"2022","unstructured":"Dedong Xie, Zhen Jia, Zili Zhang, and Xin Jin. 2022. Optimizing half precision Winograd convolution on ARM many-core processors. In Proceedings of the ACM SIGOPS Asia-Pacific Workshop on Systems. 53\u201360."},{"key":"e_1_3_1_60_2","first-page":"32","volume-title":"Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"Yan Da","year":"2020","unstructured":"Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing batched winograd convolution on GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 32\u201344."},{"key":"e_1_3_1_61_2","doi-asserted-by":"crossref","first-page":"1209","DOI":"10.1145\/3123266.3129393","volume-title":"Proceedings of the International Conference on Multimedia","author":"Yang Haojin","year":"2017","unstructured":"Haojin Yang, Martin Fritzsche, Christian Bartz, and Christoph Meinel. 2017. Bmxnet: An open-source binary neural network implementation based on mxnet. In Proceedings of the International Conference on Multimedia. 1209\u20131212."},{"key":"e_1_3_1_62_2","first-page":"1780","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo","author":"Yao Yiwu","year":"2019","unstructured":"Yiwu Yao, Bin Dong, Yuke Li, Weiqiang Yang, and Haoqi Zhu. 2019. Efficient implementation of convolutional neural networks with end to end integer-only dataflow. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1780\u20131785."},{"key":"e_1_3_1_63_2","unstructured":"Zhewei Yao Zhen Dong Zhangcheng Zheng Amir Gholami Jiali Yu Eric Tan Leyuan Wang Qijing Huang Yida Wang Michael Mahoney et\u00a0al. 2021. HAWQ-V3: Dyadic neural network quantization. In International Conference on Machine Learning. PMLR 11875\u201311886."},{"key":"e_1_3_1_64_2","doi-asserted-by":"crossref","first-page":"414","DOI":"10.1145\/3330345.3330382","volume-title":"Proceedings of the International Conference on Supercomputing","author":"Zlateski Aleksandar","year":"2019","unstructured":"Aleksandar Zlateski, Zhen Jia, Kai Li, and Fredo Durand. 2019. The anatomy of efficient FFT and winograd convolutions on modern CPUs. In Proceedings of the International Conference on Supercomputing. 414\u2013424."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632956","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3632956","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:50Z","timestamp":1750178150000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632956"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,19]]},"references-count":63,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3632956"],"URL":"https:\/\/doi.org\/10.1145\/3632956","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,19]]},"assertion":[{"value":"2023-04-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}