{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T23:15:00Z","timestamp":1776122100667,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,2,19]],"date-time":"2020-02-19T00:00:00Z","timestamp":1582070400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"HK Research Grants Council","award":["26213818"],"award-info":[{"award-number":["26213818"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,2,19]]},"DOI":"10.1145\/3332466.3374520","type":"proceedings-article","created":{"date-parts":[[2020,2,19]],"date-time":"2020-02-19T19:13:53Z","timestamp":1582139633000},"page":"32-44","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":48,"title":["Optimizing batched winograd convolution on GPUs"],"prefix":"10.1145","author":[{"given":"Da","family":"Yan","sequence":"first","affiliation":[{"name":"HKUST"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"HKUST"}]},{"given":"Xiaowen","family":"Chu","sequence":"additional","affiliation":[{"name":"Hong Kong Baptist University"}]}],"member":"320","published-online":{"date-parts":[[2020,2,19]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"cuDNN: Efficient Primitives for Deep Learning. CoRR abs\/1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs\/1410.0759 ( 2014 ), 1--9. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs\/1410.0759 (2014), 1--9."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00069"},{"key":"e_1_3_2_1_3_1","volume-title":"Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677","author":"Goyal Priya","year":"2017","unstructured":"Priya Goyal , Piotr Doll\u00e1r , Ross B. Girshick , Pieter Noordhuis , Lukasz Wesolowski , Aapo Kyrola , Andrew Tulloch , Yangqing Jia , and Kaiming He. 2017. Accurate , Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 ( 2017 ), 1--12. Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs\/1706.02677 (2017), 1--12."},{"key":"e_1_3_2_1_4_1","volume-title":"Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770--778. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770--778."},{"key":"e_1_3_2_1_5_1","volume-title":"Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs\/1804.06826","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia , Marco Maggioni , Benjamin Staiger , and Daniele Paolo Scarpazza . 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs\/1804.06826 ( 2018 ), 1--66. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs\/1804.06826 (2018), 1--66."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178496"},{"key":"e_1_3_2_1_7_1","volume-title":"Retrieved","author":"Krizhevsky Alex","year":"2015","unstructured":"Alex Krizhevsky . 2015 . cuda-convnet2 . Retrieved Jan 12, 2019 from https:\/\/github.com\/akrizhevsky\/cuda-convnet2 Alex Krizhevsky. 2015. cuda-convnet2. Retrieved Jan 12, 2019 from https:\/\/github.com\/akrizhevsky\/cuda-convnet2"},{"key":"e_1_3_2_1_8_1","volume-title":"NIPS","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E. Hinton . 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems , NIPS 2012 . NIPS, Lake Tahoe, NV, USA, 1106--1114. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, NIPS 2012. NIPS, Lake Tahoe, NV, USA, 1106--1114."},{"key":"e_1_3_2_1_9_1","volume-title":"Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization, CGO 2013. IEEE Computer Society","author":"Lai Junjie","year":"2013","unstructured":"Junjie Lai and Andr\u00e9 Seznec . 2013 . Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs . In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization, CGO 2013. IEEE Computer Society , Shenzhen, China, 4:1--4:10. Junjie Lai and Andr\u00e9 Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization, CGO 2013. IEEE Computer Society, Shenzhen, China, 4:1--4:10."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/106972.106981"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_3_2_1_12_1","volume-title":"Fast Training of Convolutional Networks through FFTs. CoRR abs\/1312.5851","author":"Mathieu Michael","year":"2013","unstructured":"Michael Mathieu , Mikael Henaff , and Yann LeCun . 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs\/1312.5851 ( 2013 ), 1--9. Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs\/1312.5851 (2013), 1--9."},{"key":"e_1_3_2_1_13_1","first-page":"72","article-title":"Dissecting GPU Memory Hierarchy Through Microbenchmarking","volume":"28","author":"Mei Xinxin","year":"2017","unstructured":"Xinxin Mei and Xiaowen Chu . 2017 . Dissecting GPU Memory Hierarchy Through Microbenchmarking . IEEE TPDS 28 (2017), 72 -- 86 . Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE TPDS 28 (2017), 72--86.","journal-title":"IEEE TPDS"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-44917-2_13"},{"key":"e_1_3_2_1_15_1","volume-title":"Retrieved","year":"2016","unstructured":"NervanaSystems. 2016 . Maxas . Retrieved Jan 12, 2019 from https:\/\/github.com\/NervanaSystems\/maxas NervanaSystems. 2016. Maxas. Retrieved Jan 12, 2019 from https:\/\/github.com\/NervanaSystems\/maxas"},{"key":"e_1_3_2_1_16_1","volume-title":"Retrieved","year":"2016","unstructured":"NervanaSystems. 2016 . Neon . Retrieved Jan 12, 2019 from https:\/\/github.com\/NervanaSystems\/neon\/tree\/master\/neon\/backends\/kernels\/sass NervanaSystems. 2016. Neon. Retrieved Jan 12, 2019 from https:\/\/github.com\/NervanaSystems\/neon\/tree\/master\/neon\/backends\/kernels\/sass"},{"key":"e_1_3_2_1_17_1","volume-title":"Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018 . NVIDIA TURING GPU ARCHITECTURE . Retrieved Jan 12, 2019 from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf NVIDIA. 2018. NVIDIA TURING GPU ARCHITECTURE. Retrieved Jan 12, 2019 from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf"},{"key":"e_1_3_2_1_18_1","volume-title":"CUDA C Programming Guide. Retrieved","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2019. CUDA C Programming Guide. Retrieved Jul 2, 2019 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html NVIDIA. 2019. CUDA C Programming Guide. Retrieved Jul 2, 2019 from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_3_2_1_19_1","volume-title":"How to Implement Performance Metrics in CUDA C\/C++. Retrieved","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2019. How to Implement Performance Metrics in CUDA C\/C++. Retrieved Jul 2, 2019 from https:\/\/devblogs.nvidia.com\/how-implement-performance-metrics-cuda-cc\/ NVIDIA. 2019. How to Implement Performance Metrics in CUDA C\/C++. Retrieved Jul 2, 2019 from https:\/\/devblogs.nvidia.com\/how-implement-performance-metrics-cuda-cc\/"},{"key":"e_1_3_2_1_20_1","unstructured":"NVIDIA. 2019. Nsight Compute. Retrieved Jul 2 2019 from https:\/\/docs.nvidia.com\/nsight-compute\/NsightCompute\/index.html  NVIDIA. 2019. Nsight Compute. Retrieved Jul 2 2019 from https:\/\/docs.nvidia.com\/nsight-compute\/NsightCompute\/index.html"},{"key":"e_1_3_2_1_21_1","unstructured":"MLPerf Org. 2019. MLPerf. Retrieved Jul 2 2019 from https:\/\/mlperf.org\/  MLPerf Org. 2019. MLPerf. Retrieved Jul 2 2019 from https:\/\/mlperf.org\/"},{"key":"e_1_3_2_1_22_1","volume-title":"NIPS","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015 . Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems , NIPS 2015. NIPS, Montreal, Quebec, Canada, 91--99. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, NIPS 2015. NIPS, Montreal, Quebec, Canada, 91--99."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCBD.2016.029"},{"key":"e_1_3_2_1_24_1","volume-title":"Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556 ( 2014 ), 1--14. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556 (2014), 1--14."},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings of the 2008 ACM\/IEEE Conference on SuperComputing (SC). IEEE Press","author":"Volkov Vasily","unstructured":"Vasily Volkov and James W. Demmel . 2008. Benchmarking GPUs to Tune Dense Linear Algebra . In Proceedings of the 2008 ACM\/IEEE Conference on SuperComputing (SC). IEEE Press , Piscataway, NJ, USA, 31:1--31:11. Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM\/IEEE Conference on SuperComputing (SC). IEEE Press, Piscataway, NJ, USA, 31:1--31:11."},{"key":"e_1_3_2_1_26_1","volume-title":"Arithmetic complexity of computations","author":"Winograd Shmuel","unstructured":"Shmuel Winograd . 1980. Arithmetic complexity of computations . Vol. 33 . Siam, Salt Lake City, UT, USA. Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam, Salt Lake City, UT, USA."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018755"}],"event":{"name":"PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","location":"San Diego California","acronym":"PPoPP '20","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGHPC ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing"]},"container-title":["Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3332466.3374520","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3332466.3374520","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:37Z","timestamp":1750204477000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3332466.3374520"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,2,19]]},"references-count":27,"alternative-id":["10.1145\/3332466.3374520","10.1145\/3332466"],"URL":"https:\/\/doi.org\/10.1145\/3332466.3374520","relation":{},"subject":[],"published":{"date-parts":[[2020,2,19]]},"assertion":[{"value":"2020-02-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}