{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T16:30:23Z","timestamp":1775665823810,"version":"3.50.1"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2021,7,17]],"date-time":"2021-07-17T00:00:00Z","timestamp":1626480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61672526"],"award-info":[{"award-number":["61672526"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key Research and Development Project","award":["2018YFB0204301"],"award-info":[{"award-number":["2018YFB0204301"]}]},{"name":"Science and Technology Innovation project of Hunan","award":["2018RS3083 and 2019RS2027"],"award-info":[{"award-number":["2018RS3083 and 2019RS2027"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,12,31]]},"abstract":"<jats:p>\n            The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the\n            <jats:bold>processing element (PE)<\/jats:bold>\n            utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array.\n          <\/jats:p>\n          <jats:p>\n            In this article, we design a\n            <jats:bold>configurable multi-directional systolic array (CMSA)<\/jats:bold>\n            to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.\n          <\/jats:p>","DOI":"10.1145\/3460776","type":"journal-article","created":{"date-parts":[[2021,7,17]],"date-time":"2021-07-17T10:05:22Z","timestamp":1626516322000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":46,"title":["Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks"],"prefix":"10.1145","volume":"18","author":[{"given":"Rui","family":"Xu","sequence":"first","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Sheng","family":"Ma","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Yaohua","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Xinhai","family":"Chen","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Yang","family":"Guo","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2021,7,17]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541967"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS45731.2020.9180403"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750389"},{"key":"e_1_2_1_8_1","volume-title":"Borivoje Nikolic, Ion Stoica, and Krste Asanovic.","author":"Genc Hasan","year":"2019","unstructured":"Hasan Genc , Ameer Haj-Ali , Vighnesh Iyer , Alon Amid , Howard Mao , John Wright , Colin Schmidt , Jerry Zhao , Albert J. Ou , Max Banister , Yakun Sophia Shao , Borivoje Nikolic, Ion Stoica, and Krste Asanovic. 2019 . Gemmini : An agile systolic array generator enabling systematic evaluations of deep-learning architectures. CoRR abs\/1911.09925 (2019). arXiv:1911.09925 http:\/\/arxiv.org\/abs\/1911.09925. Hasan Genc, Ameer Haj-Ali, Vighnesh Iyer, Alon Amid, Howard Mao, John Wright, Colin Schmidt, Jerry Zhao, Albert J. Ou, Max Banister, Yakun Sophia Shao, Borivoje Nikolic, Ion Stoica, and Krste Asanovic. 2019. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. CoRR abs\/1911.09925 (2019). arXiv:1911.09925 http:\/\/arxiv.org\/abs\/1911.09925."},{"key":"e_1_2_1_9_1","volume-title":"High throughput matrix-matrix multiplication between asymmetric bit-width operands. CoRR abs\/2008.00638","author":"Gope Dibakar","year":"2020","unstructured":"Dibakar Gope , Jesse G. Beu , and Matthew Mattina . 2020. High throughput matrix-matrix multiplication between asymmetric bit-width operands. CoRR abs\/2008.00638 ( 2020 ). arXiv:2008.00638 https:\/\/arxiv.org\/abs\/2008.00638. Dibakar Gope, Jesse G. Beu, and Matthew Mattina. 2020. High throughput matrix-matrix multiplication between asymmetric bit-width operands. CoRR abs\/2008.00638 (2020). arXiv:2008.00638 https:\/\/arxiv.org\/abs\/2008.00638."},{"key":"e_1_2_1_10_1","volume-title":"32nd International Conference on Machine Learning (ICML\u201915)","volume":"37","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In 32nd International Conference on Machine Learning (ICML\u201915) , (Lille, France, July 6\u201311 , 2015) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37 . JMLR.org, 1737\u20131746. http:\/\/proceedings.mlr.press\/v37\/gupta15.html. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In 32nd International Conference on Machine Learning (ICML\u201915), (Lille, France, July 6\u201311, 2015) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 1737\u20131746. http:\/\/proceedings.mlr.press\/v37\/gupta15.html."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1142\/S1793351X16500045"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_13_1","volume-title":"2020 International Conference on Supercomputing (ICS\u201920)","author":"He Xin","year":"2020","unstructured":"Xin He , Subhankar Pal , Aporva Amarnath , Siying Feng , Dong-Hyeon Park , Austin Rovinski , Haojie Ye , Kuan-Yu Chen , Ronald G. Dreslinski , and Trevor N. Mudge . 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices . In 2020 International Conference on Supercomputing (ICS\u201920) (Barcelona Spain , June 2020 ), Eduard Ayguad\u00e9, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1\u201319:12. https:\/\/dl.acm.org\/doi\/10.1145\/3392717.3392751. Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In 2020 International Conference on Supercomputing (ICS\u201920) (Barcelona Spain, June 2020), Eduard Ayguad\u00e9, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1\u201319:12. https:\/\/dl.acm.org\/doi\/10.1145\/3392717.3392751."},{"key":"e_1_2_1_14_1","volume-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs\/1704.04861","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard , Menglong Zhu , Bo Chen , Dmitry Kalenichenko , Weijun Wang , Tobias Weyand , Marco Andreetto , and Hartwig Adam . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs\/1704.04861 ( 2017 ). arXiv:1704.04861 http:\/\/arxiv.org\/abs\/1704.04861. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs\/1704.04861 (2017). arXiv:1704.04861 http:\/\/arxiv.org\/abs\/1704.04861."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI49217.2020.00088"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_17_1","unstructured":"A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases 1 4 (2009).  A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases 1 4 (2009)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304028"},{"key":"e_1_2_1_20_1","volume-title":"Term Revealing: Furthering quantization at run time on quantized DNNs. CoRR abs\/2007.06389","author":"Kung H. T.","year":"2020","unstructured":"H. T. Kung , Bradley McDanel , and Sai Qian Zhang . 2020 . Term Revealing: Furthering quantization at run time on quantized DNNs. CoRR abs\/2007.06389 (2020). arXiv:2007.06389 https:\/\/arxiv.org\/abs\/2007.06389. H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2020. Term Revealing: Furthering quantization at run time on quantized DNNs. CoRR abs\/2007.06389 (2020). arXiv:2007.06389 https:\/\/arxiv.org\/abs\/2007.06389."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2019.00-31"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358252"},{"key":"e_1_2_1_23_1","volume-title":"MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs\/1805.02566","author":"Kwon Hyoukjun","year":"2018","unstructured":"Hyoukjun Kwon , Michael Pellauer , and Tushar Krishna . 2018 . MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs\/1805.02566 (2018). arXiv:1805.02566 http:\/\/arxiv.org\/abs\/1805.02566. Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. 2018. MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs\/1805.02566 (2018). arXiv:1805.02566 http:\/\/arxiv.org\/abs\/1805.02566."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173176"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1989.1.4.541"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.2979965"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_2_1_28_1","volume-title":"Mini-batch Serialization: CNN training with inter-layer data reuse. In Machine Learning and Systems 2019 (MLSys 2019) (Stanford, CA, USA, March 31\u2013April 2","author":"Lym Sangkug","year":"2019","unstructured":"Sangkug Lym , Armand Behroozi , Wei Wen , Ge Li , Yongkee Kwon , and Mattan Erez . 2019 . Mini-batch Serialization: CNN training with inter-layer data reuse. In Machine Learning and Systems 2019 (MLSys 2019) (Stanford, CA, USA, March 31\u2013April 2 , 2019), Ameet Talwalkar, Virginia Smith , and Matei Zaharia (Eds .). mlsys.org. https:\/\/proceedings.mlsys.org\/book\/261.pdf. Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Yongkee Kwon, and Mattan Erez. 2019. Mini-batch Serialization: CNN training with inter-layer data reuse. In Machine Learning and Systems 2019 (MLSys 2019) (Stanford, CA, USA, March 31\u2013April 2, 2019), Ameet Talwalkar, Virginia Smith, and Matei Zaharia (Eds.). mlsys.org. https:\/\/proceedings.mlsys.org\/book\/261.pdf."},{"key":"e_1_2_1_29_1","volume-title":"FlexSA: Flexible systolic array architecture for efficient pruned DNN model training. CoRR abs\/2004.13027","author":"Lym Sangkug","year":"2020","unstructured":"Sangkug Lym and Mattan Erez . 2020. FlexSA: Flexible systolic array architecture for efficient pruned DNN model training. CoRR abs\/2004.13027 ( 2020 ). arXiv:2004.13027 https:\/\/arxiv.org\/abs\/2004.13027. Sangkug Lym and Mattan Erez. 2020. FlexSA: Flexible systolic array architecture for efficient pruned DNN model training. CoRR abs\/2004.13027 (2020). arXiv:2004.13027 https:\/\/arxiv.org\/abs\/2004.13027."},{"key":"e_1_2_1_30_1","volume-title":"Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos.","author":"Mahmoud Mostafa","year":"2020","unstructured":"Mostafa Mahmoud , Isak Edo Vivancos , Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos. 2020 . TensorDash: Exploiting sparsity to accelerate deep neural network training and inference. CoRR abs\/2009.00748 (2020). arXiv:2009.00748 https:\/\/arxiv.org\/abs\/2009.00748. Mostafa Mahmoud, Isak Edo Vivancos, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos. 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training and inference. CoRR abs\/2009.00748 (2020). arXiv:2009.00748 https:\/\/arxiv.org\/abs\/2009.00748."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/HCS49909.2020.9220735"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/MWSCAS.2012.6292202"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00015"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00016"},{"key":"e_1_2_1_35_1","volume-title":"SCALE-Sim: Systolic CNN accelerator. CoRR abs\/1811.02883","author":"Samajdar Ananda","year":"2018","unstructured":"Ananda Samajdar , Yuhao Zhu , Paul N. Whatmough , Matthew Mattina , and Tushar Krishna . 2018. SCALE-Sim: Systolic CNN accelerator. CoRR abs\/1811.02883 ( 2018 ). arXiv:1811.02883 http:\/\/arxiv.org\/abs\/1811.02883. Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. CoRR abs\/1811.02883 (2018). arXiv:1811.02883 http:\/\/arxiv.org\/abs\/1811.02883."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358302"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322255"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2019.2924007"},{"key":"e_1_2_1_39_1","volume-title":"3rd International Conference on Learning Representations (ICLR\u201915)","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman . 2015 . Very deep convolutional networks for large-scale image recognition . In 3rd International Conference on Learning Representations (ICLR\u201915) (San Diego, CA , May 7-9, 2015), Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1409.1556. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR\u201915) (San Diego, CA, May 7-9, 2015), Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_2_1_40_1","volume-title":"36th International Conference on Machine Learning (ICML\u201919)","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc V. Le . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks . In 36th International Conference on Machine Learning (ICML\u201919) , (Long Beach, CA , June 9-15, 2019 ) (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 6105\u20136114. http:\/\/proceedings.mlr.press\/v97\/tan19a.html. Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning (ICML\u201919), (Long Beach, CA, June 9-15, 2019) (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 6105\u20136114. http:\/\/proceedings.mlr.press\/v97\/tan19a.html."},{"key":"e_1_2_1_41_1","volume-title":"CoRR abs\/1907.10701","author":"Wang Yu","year":"2019","unstructured":"Yu Wang , Gu-Yeon Wei , and David Brooks . 2019. Benchmarking TPU, GPU , and CPU platforms for deep learning. CoRR abs\/1907.10701 ( 2019 ). arXiv:1907.10701 http:\/\/arxiv.org\/abs\/1907.10701. Yu Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. CoRR abs\/1907.10701 (2019). arXiv:1907.10701 http:\/\/arxiv.org\/abs\/1907.10701."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_2_1_43_1","volume-title":"DNN dataflow choice is overrated. CoRR abs\/1809.04070","author":"Yang Xuan","year":"2018","unstructured":"Xuan Yang , Mingyu Gao , Jing Pu , Ankita Nayak , Qiaoyi Liu , Steven Bell , Jeff Setter , Kaidi Cao , Heonjae Ha , Christos Kozyrakis , and Mark Horowitz . 2018. DNN dataflow choice is overrated. CoRR abs\/1809.04070 ( 2018 ). arXiv:1809.04070 http:\/\/arxiv.org\/abs\/1809.04070. Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Bell, Jeff Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, and Mark Horowitz. 2018. DNN dataflow choice is overrated. CoRR abs\/1809.04070 (2018). arXiv:1809.04070 http:\/\/arxiv.org\/abs\/1809.04070."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460776","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460776","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:28Z","timestamp":1750195468000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460776"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,17]]},"references-count":43,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,12,31]]}},"alternative-id":["10.1145\/3460776"],"URL":"https:\/\/doi.org\/10.1145\/3460776","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,17]]},"assertion":[{"value":"2020-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}