{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T22:34:56Z","timestamp":1771022096138,"version":"3.50.1"},"reference-count":96,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,7,19]],"date-time":"2023-07-19T00:00:00Z","timestamp":1689724800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["CCF-2130688, CCF-1900904, and CNS-210705"],"award-info":[{"award-number":["CCF-2130688, CCF-1900904, and CNS-210705"]}]},{"DOI":"10.13039\/501100007129","name":"Natural Science Foundation of Shandong Province","doi-asserted-by":"crossref","award":["ZR2019LZH014 and ZR2022MF328"],"award-info":[{"award-number":["ZR2019LZH014 and ZR2022MF328"]}],"id":[{"id":"10.13039\/501100007129","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61602284 and 61602285"],"award-info":[{"award-number":["61602284 and 61602285"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Funding for Study Abroad Program by the Government of Shandong Province"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,9,30]]},"abstract":"<jats:p>The convolutional neural network (CNN) is an important deep learning method, which is widely used in many fields. However, it is very time consuming to implement the CNN where convolution usually takes most of the time. There are many zero values in feature maps and filters, which leads to redundant calculations and memory accesses if dense methods are used to compute convolution. Many works recently have made use of sparsity to skip the calculations for zero values to reduce the inference time of the CNN. On the graphics processing unit platform, current works cannot fully exploit the sparsity of the feature map and achieve satisfactory performance. Therefore, we design a new parallel strategy to transform the feature map into a new storage format to avoid the redundant computation of zero values on graphics processing units. Also considering the sparsity in the feature map, we propose a fused storage format to combine the convolution operation with the following pooling operation, to further improve the performance. We carry out experiments with mainstream CNN models and achieve better performance compared with cuDNN and cuSPARSE. For VGG-19, ResNet-50, DenseNet-121, and RegNetX-16GF, 1.97\u00d7, 2.23\u00d7, 2.74\u00d7, and 1.58\u00d7 speedups respectively are obtained over cuDNN. The speedups over cuSPARSE respectively are 2.10\u00d7, 1.83\u00d7, 2.35\u00d7, and 1.35\u00d7 when only using the first method.<\/jats:p>","DOI":"10.1145\/3600092","type":"journal-article","created":{"date-parts":[[2023,5,27]],"date-time":"2023-05-27T10:27:33Z","timestamp":1685183253000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1211-1889","authenticated-orcid":false,"given":"Weizhi","family":"Xu","sequence":"first","affiliation":[{"name":"Shandong Normal University and University of Houston"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4938-5015","authenticated-orcid":false,"given":"Yintai","family":"Sun","sequence":"additional","affiliation":[{"name":"Shandong Normal University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9160-8540","authenticated-orcid":false,"given":"Shengyu","family":"Fan","sequence":"additional","affiliation":[{"name":"Shandong Normal University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1769-1114","authenticated-orcid":false,"given":"Hui","family":"Yu","sequence":"additional","affiliation":[{"name":"Shandong Normal University and University of Houston"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9458-4769","authenticated-orcid":false,"given":"Xin","family":"Fu","sequence":"additional","affiliation":[{"name":"University of Houston"}]}],"member":"320","published-online":{"date-parts":[[2023,7,19]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado et\u00a0al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016)."},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Peter Ahrens Fredrik Kjolstad and Saman Amarasinghe. 2022. Autoscheduling for sparse tensor algebra with an asymptotic cost model. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI\u201922) . ACM New York NY 269\u2013285.","DOI":"10.1145\/3519939.3523442"},{"key":"e_1_3_2_4_2","first-page":"1","volume-title":"Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916)","author":"Albericio Jorge","year":"2016","unstructured":"Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916). 1\u201313."},{"key":"e_1_3_2_5_2","volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Alwani Manoj","year":"2016","unstructured":"Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, Los Alamitos, CA, Article 22, 12 pages."},{"key":"e_1_3_2_6_2","article-title":"High performance convolution using sparsity and patterns for inference in deep convolutional neural networks","author":"Amer Hossam","year":"2021","unstructured":"Hossam Amer, Ahmed H. Salamah, Ahmad Sajedi, and En-Hui Yang. 2021. High performance convolution using sparsity and patterns for inference in deep convolutional neural networks. arXiv preprint arXiv:2104.08314 (2021).","journal-title":"arXiv preprint arXiv:2104.08314"},{"key":"e_1_3_2_7_2","first-page":"99","volume-title":"Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD\u201920)","author":"Anderson Andrew","year":"2020","unstructured":"Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2020. High-performance low-memory lowering: GEMM-based algorithms for DNN convolution. In Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD\u201920). 99\u2013106."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Srimat Chakradhar Murugan Sankaradas Venkata Jakkula and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News 38 3 (2010) 247\u2013257.","DOI":"10.1145\/1816038.1815993"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1109\/HPCA51647.2021.00027","volume-title":"Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Chang Sung-En","year":"2021","unstructured":"Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K.-H. So, Xuehai Qian, Yanzhi Wang, and Xue Lin. 2021. Mix and match: A novel FPGA-centric deep neural network quantization framework. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921). 208\u2013220."},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong Wu Yunji Chen and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Computer Architecture News 42 1 (2014) 269\u2013284.","DOI":"10.1145\/2654822.2541967"},{"key":"e_1_3_2_11_2","article-title":"Escoin: Efficient sparse convolutional neural network inference on GPUs","author":"Chen Xuhao","year":"2018","unstructured":"Xuhao Chen. 2018. Escoin: Efficient sparse convolutional neural network inference on GPUs. arXiv preprint arXiv:1802.10280 (2018).","journal-title":"arXiv preprint arXiv:1802.10280"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Yu-Hsin Chen Joel Emer and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916) . IEEE Los Alamitos CA.","DOI":"10.1109\/ISCA.2016.40"},{"key":"e_1_3_2_13_2","article-title":"cuDNN: Efficient primitives for deep learning","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).","journal-title":"arXiv preprint arXiv:1410.0759"},{"key":"e_1_3_2_14_2","first-page":"27","volume-title":"Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916)","author":"Chi Ping","year":"2016","unstructured":"Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916). 27\u201339."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00104"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1109\/ISSCC42613.2021.9365803","volume-title":"Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC\u201921)","volume":"64","author":"Choquette Jack","year":"2021","unstructured":"Jack Choquette, Edward Lee, Ronny Krashinsky, Vishnu Balan, and Brucek Khailany. 2021. 3.2 The A100 datacenter GPU and ampere architecture. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC\u201921), Vol. 64. IEEE, Los Alamitos, CA, 48\u201350."},{"key":"e_1_3_2_17_2","first-page":"30","article-title":"Compilation of dynamic sparse tensor algebra","volume":"6","author":"Chou Stephen","year":"2022","unstructured":"Stephen Chou and Saman Amarasinghe. 2022. Compilation of dynamic sparse tensor algebra. Proceedings of the ACM on Programming Languages 6, OOPSLA2 (Oct. 2022), Article 175, 30 pages.","journal-title":"Proceedings of the ACM on Programming Languages"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1007\/978-3-319-59072-1_35","volume-title":"Advances in Neural Networks\u2014ISNN 2017","author":"Daultani Vijay","year":"2017","unstructured":"Vijay Daultani, Yoshiyuki Ohno, and Kazuhisa Ishizaka. 2017. Sparse direct convolutional neural network. In Advances in Neural Networks\u2014ISNN 2017, Fengyu Cong, Andrew Leung, and Qinglai Wei (Eds.). Springer International Publishing, Cham, Switzerland, 293\u2013303."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2020.2981068"},{"key":"e_1_3_2_20_2","first-page":"1110","volume-title":"Proceedings of the 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA\u201921)","author":"Deng Chunhua","year":"2021","unstructured":"Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan. 2021. GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator. In Proceedings of the 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA\u201921). 1110\u20131123."},{"key":"e_1_3_2_21_2","first-page":"273","volume-title":"Proceedings of the International Conference on Artificial Neural Networks","author":"Dey Sourya","year":"2017","unstructured":"Sourya Dey, Yinan Shao, Keith M. Chugg, and Peter A. Beerel. 2017. Accelerating training of deep neural networks via sparse edge processing. In Proceedings of the International Conference on Artificial Neural Networks. 273\u2013280."},{"key":"e_1_3_2_22_2","article-title":"TensorFlow distributions","author":"Dillon Joshua V.","year":"2017","unstructured":"Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous. 2017. TensorFlow distributions. arXiv preprint arXiv:1711.10604 (2017).","journal-title":"arXiv preprint arXiv:1711.10604"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3067825"},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","unstructured":"J. J. Dongarra Jeremy Du Croz Sven Hammarling and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software 16 1 (1990) 1\u201317.","DOI":"10.1145\/77626.79170"},{"key":"e_1_3_2_25_2","first-page":"92","volume-title":"Proceedings of the 2015 ACM\/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA\u201915)","author":"Du Zidong","year":"2015","unstructured":"Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 2015 ACM\/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA\u201915). 92\u2013104."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1137\/1026055"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2919554"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","unstructured":"K. Fatahalian J. Sugerman and P. Hanrahan. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH\/EUROGRAPHICS Conference on Graphics Hardware (HWWS\u201904) . ACM New York NY 133\u2013137.","DOI":"10.1145\/1058129.1058148"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433723"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2014.106"},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Ashish Gondimalla Noah Chesnut Mithuna Thottethodi and T. N. Vijaykumar. 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919) . ACM New York NY 151\u2013165.","DOI":"10.1145\/3352460.3358291"},{"key":"e_1_3_2_32_2","doi-asserted-by":"crossref","first-page":"2694","DOI":"10.1109\/CVPR46437.2021.00272","volume-title":"Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Habibian Amirhossein","year":"2021","unstructured":"Amirhossein Habibian, Davide Abati, Taco S. Cohen, and Babak Ehteshami Bejnordi. 2021. Skip-convolutions for efficient video processing. In Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 2694\u20132703."},{"key":"e_1_3_2_33_2","first-page":"243","volume-title":"Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916)","author":"Han Song","year":"2016","unstructured":"Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916). 243\u2013254."},{"key":"e_1_3_2_34_2","first-page":"1135","article-title":"Learning both weights and connections for efficient neural network","author":"Han Song","year":"2015","unstructured":"Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS\u201915). 1135\u20131143.","journal-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS\u201915)."},{"issue":"1","key":"e_1_3_2_35_2","first-page":"1","article-title":"Breast cancer multi-classification from histopathological images with structured deep learning model","volume":"7","author":"Han Zhongyi","year":"2017","unstructured":"Zhongyi Han, Benzheng Wei, Yuanjie Zheng, Yilong Yin, Kejian Li, and Shuo Li. 2017. Breast cancer multi-classification from histopathological images with structured deep learning model. Scientific Reports 7, 1 (2017), 1\u201310.","journal-title":"Scientific Reports"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Changwan Hong Aravind Sukumaran-Rajam Bortik Bandyopadhyay Jinsung Kim S\u00fcreyya Emre Kurt Israt Nisa Shivani Sabhlok \u00dcmit V. \u00c7ataly\u00fcrek Srinivasan Parthasarathy and P. Sadayappan. 2018. Efficient sparse-matrix multi-vector product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201918) . ACM New York NY 66\u201379.","DOI":"10.1145\/3208040.3208062"},{"key":"e_1_3_2_37_2","unstructured":"Liancheng Jia Yun Liang Xiuhong Li Liqiang Lu and Shengen Yan. 2020. Enabling efficient fast convolution algorithms on GPUs via MegaKernels. IEEE Transactions on Computers 69 7 (2020) 986\u2013997."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_40_2","first-page":"209","volume-title":"Proceedings of the 2019 32nd International Conference on VLSI Design and the 2019 18th International Conference on Embedded Systems (VLSID\u201919)","author":"Kala S.","year":"2019","unstructured":"S. Kala, Jimson Mathew, Babita R. Jose, and S. Nalesh. 2019. UniWiG: Unified Winograd-GEMM architecture for accelerating CNN on FPGAs. In Proceedings of the 2019 32nd International Conference on VLSI Design and the 2019 18th International Conference on Embedded Systems (VLSID\u201919). 209\u2013214."},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","unstructured":"David Kirk. 2008. NVIDIA CUDA software and GPU parallel computing architecture. In Proceedings of the 6th International Symposium on Memory Management (ISMM\u201907) . 103\u2013104.","DOI":"10.1145\/1296907.1296909"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_3_2_43_2","article-title":"Accelerating AI applications with sparse matrix compression in Halide","author":"Lee Chao-Lin","year":"2022","unstructured":"Chao-Lin Lee, Chen-Ting Chao, Wei-Hsu Chu, Ming-Yu Hung, and Jenq-Kuen Lee. 2022. Accelerating AI applications with sparse matrix compression in Halide. Journal of Signal Processing Systems. Early access, November 3, 2022.","journal-title":"Journal of Signal Processing Systems."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3339186.3339194"},{"key":"e_1_3_2_45_2","first-page":"633","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201916)","author":"Li Chao","year":"2016","unstructured":"Chao Li, Yi Yang, Min Feng, Srimat Chakradhar, and Huiyang Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201916). IEEE, Los Alamitos, CA, 633\u2013644."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2019.2924215"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295734"},{"key":"e_1_3_2_48_2","first-page":"67","volume-title":"Proceedings of the 2016 45th International Conference on Parallel Processing (ICPP\u201916)","author":"Li Xiaqing","year":"2016","unstructured":"Xiaqing Li, Guangyan Zhang, H. Howie Huang, Zhufan Wang, and Weimin Zheng. 2016. Performance analysis of GPU-based convolutional neural networks. In Proceedings of the 2016 45th International Conference on Parallel Processing (ICPP\u201916). IEEE, Los Alamitos, CA, 67\u201376."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.31"},{"key":"e_1_3_2_50_2","first-page":"806","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Liu Baoyuan","year":"2015","unstructured":"Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 806\u2013814."},{"key":"e_1_3_2_51_2","unstructured":"Xiaolong Ma Fu-Ming Guo Wei Niu Xue Lin Jian Tang Kaisheng Ma Bin Ren and Yanzhi Wang. 2019. PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. arXiv:1909.05073 (2019)."},{"key":"e_1_3_2_52_2","article-title":"End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF","author":"Ma Xuezhe","year":"2016","unstructured":"Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354 (2016).","journal-title":"arXiv preprint arXiv:1603.01354"},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","first-page":"781","DOI":"10.1109\/MICRO50266.2020.00069","volume-title":"Proceedings of the 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920)","author":"Mahmoud Mostafa","year":"2020","unstructured":"Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos. 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training. In Proceedings of the 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920). 781\u2013795."},{"key":"e_1_3_2_54_2","doi-asserted-by":"crossref","unstructured":"Abhinandan Majumdar Srihari Cadambi Michela Becchi Srimat T. Chakradhar and Hans Peter Graf. 2012. A massively parallel energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization 9 1 (2012) Article 6 30 pages.","DOI":"10.1145\/2133382.2133388"},{"key":"e_1_3_2_55_2","first-page":"522","volume-title":"Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Markidis Stefano","year":"2018","unstructured":"Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core programmability, performance and precision. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). 522\u2013531."},{"key":"e_1_3_2_56_2","article-title":"Fast training of convolutional networks through FFTs","author":"Mathieu Michael","year":"2013","unstructured":"Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851 (2013).","journal-title":"arXiv preprint arXiv:1312.5851"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-018-3761-1"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Yuyao Niu Zhengyang Lu Haonan Ji Shuhui Song Zhou Jin and Weifeng Liu. 2022. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201922) . ACM New York NY 90\u2013106.","DOI":"10.1145\/3503221.3508431"},{"key":"e_1_3_2_59_2","volume-title":"cuDNN\u2014GPU Accelerated Deep Learning","year":"2014","unstructured":"NVIDIA. 2014. cuDNN\u2014GPU Accelerated Deep Learning. NVIDIA."},{"key":"e_1_3_2_60_2","volume-title":"CUDA Documentation","year":"2018","unstructured":"NVIDIA. 2018. CUDA Documentation. NVIDIA."},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080254"},{"key":"e_1_3_2_62_2","article-title":"Faster CNNs with direct sparse convolutions and guided pruning","author":"Park Jongsoo","year":"2016","unstructured":"Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016. Faster CNNs with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016).","journal-title":"arXiv preprint arXiv:1608.01409"},{"key":"e_1_3_2_63_2","first-page":"8026","article-title":"PyTorch: An imperative style, high-performance deep learning library","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et\u00a0al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS\u201919). 8026\u20138037.","journal-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS\u201919)."},{"key":"e_1_3_2_64_2","first-page":"13","volume-title":"Proceedings of the 2013 IEEE 31st International Conference on Computer Design (ICCD\u201913)","author":"Peemen Maurice","year":"2013","unstructured":"Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, and Henk Corporaal. 2013. Memory-centric accelerator design for convolutional neural networks. In Proceedings of the 2013 IEEE 31st International Conference on Computer Design (ICCD\u201913). 13\u201319."},{"key":"e_1_3_2_65_2","doi-asserted-by":"crossref","unstructured":"Wajahat Qadeer Rehan Hameed Ofer Shacham Preethi Venkatesan Christos Kozyrakis and Mark A. Horowitz. 2015. Convolution engine: Balancing efficiency and flexibility in specialized computing. Communications of the ACM 58 4 (2015) 85\u201393.","DOI":"10.1145\/2735841"},{"key":"e_1_3_2_66_2","first-page":"58","volume-title":"Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920)","author":"Qin Eric","year":"2020","unstructured":"Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201920). 58\u201370."},{"key":"e_1_3_2_67_2","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1109\/IISWC47752.2019.9042000","volume-title":"Proceedings of the 2019 IEEE International Symposium on Workload Characterization (IISWC\u201919)","author":"Radu Valentin","year":"2019","unstructured":"Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, Jos\u00e9 Cano, Elliot J. Crowley, Bj\u00f6rn Franke, Amos Storkey, and Michael O\u2019Boyle. 2019. Performance aware convolutional neural network channel pruning for embedded GPUs. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization (IISWC\u201919). 24\u201334."},{"key":"e_1_3_2_68_2","unstructured":"Simon Rovder Jos\u00e9 Cano and Michael O\u2019Boyle. 2019. Optimising convolutional neural networks inference on low-powered GPUs. In Proceedings of the 12th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG\u201919) ."},{"key":"e_1_3_2_69_2","volume-title":"CUDA by Example: An Introduction to General-Purpose GPU Programming","author":"Sanders Jason","year":"2010","unstructured":"Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional."},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics10080895"},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11941"},{"key":"e_1_3_2_72_2","unstructured":"Shaohuai Shi and Xiaowen Chu. 2017. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv:1704.07724 (2017)."},{"key":"e_1_3_2_73_2","doi-asserted-by":"crossref","unstructured":"Mohammadreza Soltaniyeh Richard P. Martin and Santosh Nagarakatte. 2022. An accelerator for sparse convolutional neural networks leveraging systolic general matrix-matrix multiplication. ACM Transactions on Architecture and Code Optimization 19 3 (2022) Article 42 26 pages.","DOI":"10.1145\/3532863"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.55"},{"key":"e_1_3_2_75_2","unstructured":"Zhuoran Song Yihong Xu Han Li Naifeng Jing Xiaoyao Liang and Li Jiang. 2022. DNN training acceleration via exploring GPGPU friendly sparsity. arXiv:2203.05705 (2022)."},{"key":"e_1_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1155\/2020\/3645729"},{"key":"e_1_3_2_77_2","first-page":"6105","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. 6105\u20136114."},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2844093"},{"key":"e_1_3_2_79_2","doi-asserted-by":"crossref","unstructured":"Xiaohua Wan Fa Zhang Qi Chu and Zhiyong Liu. 2012. High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs. BMC Bioinformatics 13 Suppl. 10 (2012) S4.","DOI":"10.1186\/1471-2105-13-S10-S4"},{"key":"e_1_3_2_80_2","doi-asserted-by":"crossref","unstructured":"Xueying Wang Guangli Li Xiao Dong Jiansong Li Lei Liu and Xiaobing Feng. 2020. Accelerating deep learning inference with cross-layer data reuse on GPUs. In Euro-Par 2020: Parallel Processing . Lecture Notes in Computer Science Vol. 12247. Springer 219\u2013233.","DOI":"10.1007\/978-3-030-57675-2_14"},{"key":"e_1_3_2_81_2","doi-asserted-by":"crossref","unstructured":"Ziheng Wang. 2020. SparseRT: Accelerating unstructured sparsity on GPUs for deep learning inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT\u201920) . ACM New York NY 31\u201342.","DOI":"10.1145\/3410463.3414654"},{"key":"e_1_3_2_82_2","first-page":"2082","article-title":"Learning structured sparsity in deep neural networks","author":"Wen Wei","year":"2016","unstructured":"Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS\u201916). 2082\u20132090.","journal-title":"Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS\u201916)."},{"key":"e_1_3_2_83_2","first-page":"231","volume-title":"Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel\/Distributed Computing","author":"Xu Weizhi","year":"2012","unstructured":"Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, Fenglong Song, and Zhiyong Liu. 2012. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU. In Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel\/Distributed Computing. IEEE, Los Alamitos, CA, 231\u2013235."},{"key":"e_1_3_2_84_2","first-page":"3634","volume-title":"Proceedings of the 2022 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV\u201922)","author":"Xu Zirui","year":"2022","unstructured":"Zirui Xu, Fuxun Yu, Chenxi Liu, Zhe Wu, Hongcheng Wang, and Xiang Chen. 2022. FalCon: Fine-grained feature map sparsity computing with decomposed convolutions for inference optimization. In Proceedings of the 2022 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV\u201922). 3634\u20133644."},{"key":"e_1_3_2_85_2","doi-asserted-by":"crossref","first-page":"894","DOI":"10.1109\/HPCA51647.2021.00079","volume-title":"Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Yang Jianxun","year":"2021","unstructured":"Jianxun Yang, Zhao Zhang, Zhuangzhi Liu, Jing Zhou, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2021. FuseKNA: Fused kernel convolution based accelerator for deep neural networks. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921). 894\u2013907."},{"key":"e_1_3_2_86_2","first-page":"236","volume-title":"Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA\u201919)","author":"Yang Tzu-Hsien","year":"2019","unstructured":"Tzu-Hsien Yang, Hsiang-Yun Cheng, Chia-Lin Yang, I-Ching Tseng, Han-Wen Hu, Hung-Sheng Chang, and Hsiang-Pang Li. 2019. Sparse ReRAM engine: Joint exploration of activation and weight sparsity in compressed neural networks. In Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA\u201919). 236\u2013249."},{"key":"e_1_3_2_87_2","doi-asserted-by":"crossref","unstructured":"Zhuliang Yao Shijie Cao Wencong Xiao Chen Zhang and Lanshun Nie. 2019. Balanced sparsity for efficient DNN inference on GPU. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence the 31st Innovative Applications of Artificial Intelligence Conference and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI\/IAAI\/EAAI\u201919) . 5676\u20135683.","DOI":"10.1609\/aaai.v33i01.33015676"},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2017.2778281"},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2858230"},{"key":"e_1_3_2_90_2","first-page":"548","volume-title":"Proceedings of the 2017 ACM\/IEEE 44th Annual International Symposium on Computer Architecture (ISCA\u201917)","author":"Yu Jiecao","year":"2017","unstructured":"Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 2017 ACM\/IEEE 44th Annual International Symposium on Computer Architecture (ISCA\u201917). 548\u2013560."},{"key":"e_1_3_2_91_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2785257"},{"key":"e_1_3_2_93_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2020.3043870"},{"key":"e_1_3_2_94_2","first-page":"1","volume-title":"Proceedings of the 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Zhang Shijin","year":"2016","unstructured":"Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 1\u201312."},{"key":"e_1_3_2_95_2","doi-asserted-by":"crossref","unstructured":"Yue Zhao Jiajia Li Chunhua Liao and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201918) . ACM New York NY 94\u2013108.","DOI":"10.1145\/3178487.3178495"},{"key":"e_1_3_2_96_2","doi-asserted-by":"publisher","DOI":"10.1145\/3018661.3018665"},{"key":"e_1_3_2_97_2","unstructured":"Maohua Zhu Tao Zhang Zhenyu Gu and Yuan Xie. 2019. Sparse Tensor Core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs. In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201919) . ACM New York NY 359\u2013371."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600092","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3600092","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:50Z","timestamp":1750178210000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600092"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,19]]},"references-count":96,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,30]]}},"alternative-id":["10.1145\/3600092"],"URL":"https:\/\/doi.org\/10.1145\/3600092","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,19]]},"assertion":[{"value":"2022-12-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}