{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T00:45:42Z","timestamp":1773708342902,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,9]],"date-time":"2021-08-09T00:00:00Z","timestamp":1628467200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,9]]},"DOI":"10.1145\/3472456.3472473","type":"proceedings-article","created":{"date-parts":[[2021,10,5]],"date-time":"2021-10-05T18:39:57Z","timestamp":1633459197000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Optimizing Winograd-Based Convolution with Tensor Cores"],"prefix":"10.1145","author":[{"given":"Junhong","family":"Liu","sequence":"first","affiliation":[{"name":"NVIDIA, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dongxu","family":"Yang","sequence":"additional","affiliation":[{"name":"NVIDIA, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junjie","family":"Lai","sequence":"additional","affiliation":[{"name":"NVIDIA, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,5]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Andrew Anderson Aravind Vasudevan Cormac Keane and David Gregg. 2017. Low-memory GEMM-based convolution algorithms for deep neural networks. arxiv:1709.03395\u00a0[cs.CV]  Andrew Anderson Aravind Vasudevan Cormac Keane and David Gregg. 2017. Low-memory GEMM-based convolution algorithms for deep neural networks. arxiv:1709.03395\u00a0[cs.CV]"},{"key":"e_1_3_2_1_2_1","volume-title":"2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). 142\u2013145","author":"Avilov O.","year":"2020","unstructured":"O. Avilov , S. Rimbert , A. Popov , and L. Bougrain . 2020. Deep Learning Techniques to Improve Intraoperative Awareness Detection from Electroencephalographic Signals . In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). 142\u2013145 . https:\/\/doi.org\/10.1109\/EMBC44109. 2020 .9176228 O. Avilov, S. Rimbert, A. Popov, and L. Bougrain. 2020. Deep Learning Techniques to Improve Intraoperative Awareness Detection from Electroencephalographic Signals. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). 142\u2013145. https:\/\/doi.org\/10.1109\/EMBC44109.2020.9176228"},{"key":"e_1_3_2_1_3_1","unstructured":"David Budden Alexander Matveev Shibani Santurkar Shraman\u00a0Ray Chaudhuri and Nir Shavit. 2017. Deep Tensor Convolution on Multicores. arxiv:1611.06565\u00a0[cs.CV]  David Budden Alexander Matveev Shibani Santurkar Shraman\u00a0Ray Chaudhuri and Nir Shavit. 2017. Deep Tensor Convolution on Multicores. arxiv:1611.06565\u00a0[cs.CV]"},{"key":"e_1_3_2_1_4_1","volume-title":"High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Universit\u00e9 de Rennes 1","author":"Chellapilla Kumar","year":"2006","unstructured":"Kumar Chellapilla , Sidd Puri , and Patrice Simard . 2006 . High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Universit\u00e9 de Rennes 1 , Suvisoft, La Baule (France). https:\/\/hal.inria.fr\/inria-00112631http:\/\/www.suvisoft.com. Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Universit\u00e9 de Rennes 1, Suvisoft, La Baule (France). https:\/\/hal.inria.fr\/inria-00112631http:\/\/www.suvisoft.com."},{"key":"e_1_3_2_1_5_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning -","volume":"70","author":"Cho Minsik","year":"2017","unstructured":"Minsik Cho and Daniel Brand . 2017 . MEC: Memory-Efficient Convolution for Deep Neural Network . In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML\u201917). JMLR.org, 815\u2013824. Minsik Cho and Daniel Brand. 2017. MEC: Memory-Efficient Convolution for Deep Neural Network. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML\u201917). JMLR.org, 815\u2013824."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390177"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1090\/S0002-9947-1969-0249212-8"},{"key":"e_1_3_2_1_8_1","unstructured":"Matthieu Courbariaux Yoshua Bengio and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arxiv:1412.7024\u00a0[cs.LG]  Matthieu Courbariaux Yoshua Bengio and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arxiv:1412.7024\u00a0[cs.LG]"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Evangelos Georganas Sasikanth Avancha Kunal Banerjee Dhiraj Kalamkar Greg Henry Hans Pabst and Alexander Heinecke. 2018. Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures. arxiv:1808.05567\u00a0[cs.DC]  Evangelos Georganas Sasikanth Avancha Kunal Banerjee Dhiraj Kalamkar Greg Henry Hans Pabst and Alexander Heinecke. 2018. Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures. arxiv:1808.05567\u00a0[cs.DC]","DOI":"10.1109\/SC.2018.00069"},{"key":"e_1_3_2_1_11_1","unstructured":"Suyog Gupta Ankur Agrawal Kailash Gopalakrishnan and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arxiv:1502.02551\u00a0[cs.LG]  Suyog Gupta Ankur Agrawal Kailash Gopalakrishnan and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arxiv:1502.02551\u00a0[cs.LG]"},{"key":"e_1_3_2_1_12_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385\u00a0[cs.CV]  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385\u00a0[cs.CV]"},{"key":"e_1_3_2_1_13_1","volume-title":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv:1502.03167\u00a0[cs.LG]","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy . 2015 . Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv:1502.03167\u00a0[cs.LG] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv:1502.03167\u00a0[cs.LG]"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178496"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2918851"},{"key":"e_1_3_2_1_16_1","volume-title":"Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). 725\u2013737","author":"Kim Hyeonjin","year":"2020","unstructured":"Hyeonjin Kim , Sungwoo Ahn , Yunho Oh , Bogil Kim , Won\u00a0Woo Ro , and William\u00a0 J. Song . 2020 . Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). 725\u2013737 . https:\/\/doi.org\/10.1109\/MICRO50266.2020.00065 Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won\u00a0Woo Ro, and William\u00a0J. Song. 2020. Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). 725\u2013737. https:\/\/doi.org\/10.1109\/MICRO50266.2020.00065"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO). 1\u201310","author":"Lai J.","year":"2013","unstructured":"J. Lai and A. Seznec . 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs . In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO). 1\u201310 . https:\/\/doi.org\/10.1109\/CGO. 2013 .6494986 J. Lai and A. Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO). 1\u201310. https:\/\/doi.org\/10.1109\/CGO.2013.6494986"},{"key":"e_1_3_2_1_19_1","unstructured":"Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arxiv:1509.09308\u00a0[cs.NE]  Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arxiv:1509.09308\u00a0[cs.NE]"},{"key":"e_1_3_2_1_20_1","volume-title":"Performance & Precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2018","author":"Markidis Stefano","year":"2018","unstructured":"Stefano Markidis , Steven Wei\u00a0Der Chien , Erwin Laure , Ivy\u00a0Bo Peng , and Jeffrey\u00a0 S. Vetter . 2018 . NVIDIA Tensor Core Programmability , Performance & Precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2018 ). https:\/\/doi.org\/10.1109\/ipdpsw.2018.00091 Stefano Markidis, Steven Wei\u00a0Der Chien, Erwin Laure, Ivy\u00a0Bo Peng, and Jeffrey\u00a0S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2018). https:\/\/doi.org\/10.1109\/ipdpsw.2018.00091"},{"key":"e_1_3_2_1_21_1","unstructured":"Michael Mathieu Mikael Henaff and Yann LeCun. 2014. Fast Training of Convolutional Networks through FFTs. arxiv:1312.5851\u00a0[cs.CV]  Michael Mathieu Mikael Henaff and Yann LeCun. 2014. Fast Training of Convolutional Networks through FFTs. arxiv:1312.5851\u00a0[cs.CV]"},{"key":"e_1_3_2_1_22_1","unstructured":"Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh and Hao Wu. 2018. Mixed Precision Training. arxiv:1710.03740\u00a0[cs.AI]  Paulius Micikevicius Sharan Narang Jonah Alben Gregory Diamos Erich Elsen David Garcia Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh and Hao Wu. 2018. Mixed Precision Training. arxiv:1710.03740\u00a0[cs.AI]"},{"key":"e_1_3_2_1_23_1","unstructured":"Tran\u00a0Minh Quan David G.\u00a0C. Hildebrand and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arxiv:1612.05360\u00a0[cs.CV]  Tran\u00a0Minh Quan David G.\u00a0C. Hildebrand and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arxiv:1612.05360\u00a0[cs.CV]"},{"key":"e_1_3_2_1_24_1","volume-title":"Advances in Neural Information Processing Systems, C.\u00a0Cortes, N.\u00a0Lawrence, D.\u00a0Lee, M.\u00a0Sugiyama, and R.\u00a0Garnett (Eds.), Vol.\u00a028. Curran Associates","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks . In Advances in Neural Information Processing Systems, C.\u00a0Cortes, N.\u00a0Lawrence, D.\u00a0Lee, M.\u00a0Sugiyama, and R.\u00a0Garnett (Eds.), Vol.\u00a028. Curran Associates , Inc .https:\/\/proceedings.neurips.cc\/paper\/ 2015 \/file\/14bfa6bb14875e45bba028a21ed38046-Paper.pdf Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, C.\u00a0Cortes, N.\u00a0Lawrence, D.\u00a0Lee, M.\u00a0Sugiyama, and R.\u00a0Garnett (Eds.), Vol.\u00a028. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper\/2015\/file\/14bfa6bb14875e45bba028a21ed38046-Paper.pdf"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_26_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556\u00a0[cs.CV]  Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556\u00a0[cs.CV]"},{"key":"e_1_3_2_1_27_1","volume-title":"GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores. In 2020 57th ACM\/IEEE Design Automation Conference (DAC). 1\u20136. https:\/\/doi.org\/10","author":"Song Zhuoran","year":"2020","unstructured":"Zhuoran Song , Jianfei Wang , Tianjian Li , Li Jiang , Jing Ke , Xiaoyao Liang , and Naifeng Jing . 2020 . GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores. In 2020 57th ACM\/IEEE Design Automation Conference (DAC). 1\u20136. https:\/\/doi.org\/10 .1109\/DAC18072.2020.9218566 Zhuoran Song, Jianfei Wang, Tianjian Li, Li Jiang, Jing Ke, Xiaoyao Liang, and Naifeng Jing. 2020. GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores. In 2020 57th ACM\/IEEE Design Automation Conference (DAC). 1\u20136. https:\/\/doi.org\/10.1109\/DAC18072.2020.9218566"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063431"},{"key":"e_1_3_2_1_29_1","unstructured":"Andrei\u00a0L Toom. 1963. The complexity of a scheme of functional elements realizing the multiplication of integers. In Soviet Mathematics Doklady Vol.\u00a03. 714\u2013716.  Andrei\u00a0L Toom. 1963. The complexity of a scheme of functional elements realizing the multiplication of integers. In Soviet Mathematics Doklady Vol.\u00a03. 714\u2013716."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CBI.2017.23"},{"key":"e_1_3_2_1_31_1","volume-title":"Advances in Neural Information Processing Systems, C.\u00a0J.\u00a0C. Burges, L.\u00a0Bottou, M.\u00a0Welling, Z.\u00a0Ghahramani, and K.\u00a0Q. Weinberger(Eds.), Vol.\u00a026. Curran Associates","author":"van\u00a0den Oord Aaron","year":"2013","unstructured":"Aaron van\u00a0den Oord , Sander Dieleman , and Benjamin Schrauwen . 2013. Deep content-based music recommendation . In Advances in Neural Information Processing Systems, C.\u00a0J.\u00a0C. Burges, L.\u00a0Bottou, M.\u00a0Welling, Z.\u00a0Ghahramani, and K.\u00a0Q. Weinberger(Eds.), Vol.\u00a026. Curran Associates , Inc .https:\/\/proceedings.neurips.cc\/paper\/ 2013 \/file\/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf Aaron van\u00a0den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in Neural Information Processing Systems, C.\u00a0J.\u00a0C. Burges, L.\u00a0Bottou, M.\u00a0Welling, Z.\u00a0Ghahramani, and K.\u00a0Q. Weinberger(Eds.), Vol.\u00a026. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper\/2013\/file\/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf"},{"key":"e_1_3_2_1_32_1","volume-title":"Deep Learning and Unsupervised Feature Learning Workshop, NIPS","author":"Vanhoucke Vincent","year":"2011","unstructured":"Vincent Vanhoucke , Andrew Senior , and Mark\u00a0 Z. Mao . 2011 . Improving the speed of neural networks on CPUs . In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011. Vincent Vanhoucke, Andrew Senior, and Mark\u00a0Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011."},{"key":"e_1_3_2_1_33_1","unstructured":"Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith Chintala Serkan Piantino and Yann LeCun. 2015. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. arxiv:1412.7580\u00a0[cs.LG]  Nicolas Vasilache Jeff Johnson Michael Mathieu Soumith Chintala Serkan Piantino and Yann LeCun. 2015. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. arxiv:1412.7580\u00a0[cs.LG]"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00071"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374520"},{"key":"e_1_3_2_1_36_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a080)","author":"Zhang Jiyuan","year":"2018","unstructured":"Jiyuan Zhang , Franz Franchetti , and Tze\u00a0Meng Low . 2018 . High Performance Zero-Memory Overhead Direct Convolutions . In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a080) , Jennifer Dyand Andreas Krause (Eds.). PMLR, Stockholmsm\u00c3\u00a4ssan, Stockholm Sweden, 5776\u20135785. http:\/\/proceedings.mlr.press\/v80\/zhang18d.html Jiyuan Zhang, Franz Franchetti, and Tze\u00a0Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a080), Jennifer Dyand Andreas Krause (Eds.). PMLR, Stockholmsm\u00c3\u00a4ssan, Stockholm Sweden, 5776\u20135785. http:\/\/proceedings.mlr.press\/v80\/zhang18d.html"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018755"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.72"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2016.119"}],"event":{"name":"ICPP 2021: 50th International Conference on Parallel Processing","location":"Lemont IL USA","acronym":"ICPP 2021"},"container-title":["50th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472473","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3472456.3472473","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:11Z","timestamp":1750193291000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3472456.3472473"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,9]]},"references-count":39,"alternative-id":["10.1145\/3472456.3472473","10.1145\/3472456"],"URL":"https:\/\/doi.org\/10.1145\/3472456.3472473","relation":{},"subject":[],"published":{"date-parts":[[2021,8,9]]},"assertion":[{"value":"2021-10-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}