{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T22:22:55Z","timestamp":1766269375032,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":29,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,9]],"date-time":"2021-08-09T00:00:00Z","timestamp":1628467200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,9]]},"DOI":"10.1145\/3472456.3472496","type":"proceedings-article","created":{"date-parts":[[2021,10,5]],"date-time":"2021-10-05T18:46:04Z","timestamp":1633459564000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Optimizing Massively Parallel Winograd Convolution on ARM Processor"],"prefix":"10.1145","author":[{"given":"Dongsheng","family":"Li","sequence":"first","affiliation":[{"name":"Sun Yat-sen University, China"}]},{"given":"Dan","family":"Huang","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, China"}]},{"given":"Zhiguang","family":"Chen","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, China"}]},{"given":"Yutong","family":"Lu","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, China"}]}],"member":"320","published-online":{"date-parts":[[2021,10,5]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. https:\/\/colfaxresearch.com\/falcon-library\/ Accessed: 03-28-2021.  2016. FALCON Library: Fast Image Convolution in Neural Networks on Intel Architecture. https:\/\/colfaxresearch.com\/falcon-library\/ Accessed: 03-28-2021."},{"key":"e_1_3_2_1_2_1","unstructured":"Accessed: 03-28-2021. Arm Developer: Neon intrinsics guide. https:\/\/developer.arm.com\/architectures\/instruction-sets\/simd-isas\/neon\/intrinsics. Accessed: 03-28-2021.  Accessed: 03-28-2021. Arm Developer: Neon intrinsics guide. https:\/\/developer.arm.com\/architectures\/instruction-sets\/simd-isas\/neon\/intrinsics. Accessed: 03-28-2021."},{"key":"e_1_3_2_1_3_1","volume-title":"03-28-2021","author":"Accessed","year":"2021","unstructured":"Accessed : 03-28-2021 . Intel\u00ae Math Kernel Library . https:\/\/software. intel.com\/en-us\/mkl. Accessed: 03-28- 2021 . Accessed: 03-28-2021. Intel\u00ae Math Kernel Library. https:\/\/software. intel.com\/en-us\/mkl. Accessed: 03-28-2021."},{"key":"e_1_3_2_1_4_1","unstructured":"Accessed: 03-28-2021. MKL-DNN. https:\/\/github.com\/oneapi-src\/oneDNN Accessed: 03-28-2021.  Accessed: 03-28-2021. MKL-DNN. https:\/\/github.com\/oneapi-src\/oneDNN Accessed: 03-28-2021."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCS.2018.00068"},{"key":"e_1_3_2_1_6_1","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs\/1410.0759(2014). arxiv:1410.0759http:\/\/arxiv.org\/abs\/1410.0759  Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs\/1410.0759(2014). arxiv:1410.0759http:\/\/arxiv.org\/abs\/1410.0759"},{"key":"e_1_3_2_1_7_1","unstructured":"Matthieu Courbariaux Yoshua Bengio and J. David. 2015. Low precision arithmetic for deep learning. CoRR abs\/1412.7024(2015).  Matthieu Courbariaux Yoshua Bengio and J. David. 2015. Low precision arithmetic for deep learning. CoRR abs\/1412.7024(2015)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2851141.2851193"},{"key":"e_1_3_2_1_9_1","unstructured":"Marat Dukhan. Accessed: 03-28-2021. NNPACK. https:\/\/github.com\/Maratyszcza\/NNPACK Accessed: 03-28-2021.  Marat Dukhan. Accessed: 03-28-2021. NNPACK. https:\/\/github.com\/Maratyszcza\/NNPACK Accessed: 03-28-2021."},{"key":"e_1_3_2_1_10_1","unstructured":"Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. CoRR abs\/1603.07285(2016). arxiv:1603.07285http:\/\/arxiv.org\/abs\/1603.07285  Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. CoRR abs\/1603.07285(2016). arxiv:1603.07285http:\/\/arxiv.org\/abs\/1603.07285"},{"key":"e_1_3_2_1_11_1","volume-title":"03-28-2021","author":"Giorgio Michele\u00a0Di","year":"2021","unstructured":"Michele\u00a0Di Giorgio . Accessed : 03-28-2021 . ARM Compute Library . https:\/\/github.com\/ARM-software\/ComputeLibrary Accessed: 03-28- 2021 . Michele\u00a0Di Giorgio. Accessed: 03-28-2021. ARM Compute Library. https:\/\/github.com\/ARM-software\/ComputeLibrary Accessed: 03-28-2021."},{"key":"e_1_3_2_1_12_1","volume-title":"Proceedings of the 32nd International Conference on International Conference on Machine Learning -","volume":"37","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep Learning with Limited Numerical Precision . In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML\u201915). JMLR.org, 1737\u20131746. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML\u201915). JMLR.org, 1737\u20131746."},{"key":"e_1_3_2_1_13_1","volume-title":"Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA , June 27-30, 2016. IEEE Computer Society, 770\u2013778. https:\/\/doi.org\/10.1109\/CVPR.2016.90 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770\u2013778. https:\/\/doi.org\/10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_14_1","volume-title":"Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017","author":"Huang Gao","year":"2017","unstructured":"Gao Huang , Zhuang Liu , Laurens van\u00a0der Maaten , and Kilian\u00a0 Q. Weinberger . 2017 . Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , Honolulu, HI, USA , July 21-26, 2017. IEEE Computer Society, 2261\u20132269. https:\/\/doi.org\/10.1109\/CVPR.2017.243 Gao Huang, Zhuang Liu, Laurens van\u00a0der Maaten, and Kilian\u00a0Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2261\u20132269. https:\/\/doi.org\/10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_1_15_1","volume-title":"van\u00a0de Geijn","author":"Huang Jianyu","year":"2016","unstructured":"Jianyu Huang and Robert\u00a0 A. van\u00a0de Geijn . 2016 . BLISlab: A Sandbox for Optimizing GEMM. FLAME Working Note #80, TR-16-13. The University of Texas at Austin, Department of Computer Science . http:\/\/arxiv.org\/pdf\/1609.00076v1.pdf Jianyu Huang and Robert\u00a0A. van\u00a0de Geijn. 2016. BLISlab: A Sandbox for Optimizing GEMM. FLAME Working Note #80, TR-16-13. The University of Texas at Austin, Department of Computer Science. http:\/\/arxiv.org\/pdf\/1609.00076v1.pdf"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178496"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_3_2_1_18_1","volume-title":"Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey\u00a0 E. Hinton . 2012. ImageNet Classification with Deep Convolutional Neural Networks . In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012 . Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, Peter\u00a0L. Bartlett, Fernando C.\u00a0N. Pereira, Christopher J.\u00a0C. Burges, L\u00e9on Bottou, and Kilian\u00a0Q. Weinberger (Eds .). 1106\u20131114. https:\/\/proceedings.neurips.cc\/paper\/2012\/hash\/c399862d3b9d6b76c8436e924a68c45b-Abstract.html Alex Krizhevsky, Ilya Sutskever, and Geoffrey\u00a0E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, Peter\u00a0L. Bartlett, Fernando C.\u00a0N. Pereira, Christopher J.\u00a0C. Burges, L\u00e9on Bottou, and Kilian\u00a0Q. Weinberger (Eds.). 1106\u20131114. https:\/\/proceedings.neurips.cc\/paper\/2012\/hash\/c399862d3b9d6b76c8436e924a68c45b-Abstract.html"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2508834.2513149"},{"key":"e_1_3_2_1_20_1","volume-title":"Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"Lavin Andrew","year":"2016","unstructured":"Andrew Lavin and Scott Gray . 2016 . Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA , June 27-30, 2016. IEEE Computer Society, 4013\u20134021. https:\/\/doi.org\/10.1109\/CVPR.2016.435 Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 4013\u20134021. https:\/\/doi.org\/10.1109\/CVPR.2016.435"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/E2SC.2016.008"},{"key":"e_1_3_2_1_22_1","unstructured":"Tran\u00a0Minh Quan David G.\u00a0C. Hildebrand and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. CoRR abs\/1612.05360(2016). arxiv:1612.05360http:\/\/arxiv.org\/abs\/1612.05360  Tran\u00a0Minh Quan David G.\u00a0C. Hildebrand and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. CoRR abs\/1612.05360(2016). arxiv:1612.05360http:\/\/arxiv.org\/abs\/1612.05360"},{"volume-title":"Theory and application of digital signal processing","author":"Rabiner R","key":"e_1_3_2_1_23_1","unstructured":"Lawrence\u00a0 R Rabiner , Bernard Gold , and CK Yuen . 2016. Theory and application of digital signal processing . Prentice-Hall . Lawrence\u00a0R Rabiner, Bernard Gold, and CK Yuen. 2016. Theory and application of digital signal processing. Prentice-Hall."},{"key":"e_1_3_2_1_24_1","unstructured":"David Seal. 2001. ARM architecture reference manual. Pearson Education.  David Seal. 2001. ARM architecture reference manual. Pearson Education."},{"key":"e_1_3_2_1_25_1","unstructured":"J Shalf J Bashor D Patterson K Asanovic Katherine Yelick K Keutzer and T Mattson. 2009. The manycore revolution: Will HPC lead or follow?SciDAC Review 14(2009) 40\u201349.  J Shalf J Bashor D Patterson K Asanovic Katherine Yelick K Keutzer and T Mattson. 2009. The manycore revolution: Will HPC lead or follow?SciDAC Review 14(2009) 40\u201349."},{"key":"e_1_3_2_1_26_1","volume-title":"Benchmarking State-of-the-Art Deep Learning Software Tools. In 7th International Conference on Cloud Computing and Big Data, CCBD 2016","author":"Shi Shaohuai","year":"2016","unstructured":"Shaohuai Shi , Qiang Wang , Pengfei Xu , and Xiaowen Chu . 2016 . Benchmarking State-of-the-Art Deep Learning Software Tools. In 7th International Conference on Cloud Computing and Big Data, CCBD 2016 , Macau, China , November 16-18, 2016. IEEE Computer Society, 99\u2013104. https:\/\/doi.org\/10.1109\/CCBD.2016.029 Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking State-of-the-Art Deep Learning Software Tools. In 7th International Conference on Cloud Computing and Big Data, CCBD 2016, Macau, China, November 16-18, 2016. IEEE Computer Society, 99\u2013104. https:\/\/doi.org\/10.1109\/CCBD.2016.029"},{"key":"e_1_3_2_1_27_1","volume-title":"3rd International Conference on Learning Representations, ICLR","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman . 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition . In 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA , USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds .). http:\/\/arxiv.org\/abs\/1409.1556 Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"crossref","unstructured":"Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol.\u00a033. Siam.  Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol.\u00a033. Siam.","DOI":"10.1137\/1.9781611970364"},{"key":"e_1_3_2_1_29_1","unstructured":"Zhang Xianyi. Accessed: 03-28-2021. OpenBLAS. https:\/\/github.com\/xianyi\/OpenBLAS Accessed: 03-28-2021.  Zhang Xianyi. Accessed: 03-28-2021. OpenBLAS. https:\/\/github.com\/xianyi\/OpenBLAS Accessed: 03-28-2021."}],"event":{"name":"ICPP 2021: 50th International Conference on Parallel Processing","acronym":"ICPP 2021","location":"Lemont IL USA"},"container-title":["50th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472496","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3472456.3472496","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:12Z","timestamp":1750193292000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472496"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,9]]},"references-count":29,"alternative-id":["10.1145\/3472456.3472496","10.1145\/3472456"],"URL":"https:\/\/doi.org\/10.1145\/3472456.3472496","relation":{},"subject":[],"published":{"date-parts":[[2021,8,9]]},"assertion":[{"value":"2021-10-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}