{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T02:12:21Z","timestamp":1775787141311,"version":"3.50.1"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2023,11,9]],"date-time":"2023-11-09T00:00:00Z","timestamp":1699488000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Key-Area Research and Development Program of Guangdong Province","award":["#2021B0101410004, 2019B010140002"],"award-info":[{"award-number":["#2021B0101410004, 2019B010140002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2023,11,30]]},"abstract":"<jats:p>Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different data mapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB\/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.<\/jats:p>","DOI":"10.1145\/3530818","type":"journal-article","created":{"date-parts":[[2022,5,2]],"date-time":"2022-05-02T12:21:52Z","timestamp":1651494112000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1237-4945","authenticated-orcid":false,"given":"Xianghong","family":"Hu","sequence":"first","affiliation":[{"name":"Guangdong University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8034-0616","authenticated-orcid":false,"given":"Hongmin","family":"Huang","sequence":"additional","affiliation":[{"name":"Guangdong University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9700-4272","authenticated-orcid":false,"given":"Xueming","family":"Li","sequence":"additional","affiliation":[{"name":"Guangdong University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4931-9664","authenticated-orcid":false,"given":"Xin","family":"Zheng","sequence":"additional","affiliation":[{"name":"Guangdong University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9487-2675","authenticated-orcid":false,"given":"Qinyuan","family":"Ren","sequence":"additional","affiliation":[{"name":"Zhejiang University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5908-8738","authenticated-orcid":false,"given":"Jingyu","family":"He","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2421-7621","authenticated-orcid":false,"given":"Xiaoming","family":"Xiong","sequence":"additional","affiliation":[{"name":"Guangdong University of Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2023,11,9]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2019.2962413"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3476994"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2011.2134090"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2009.2037650"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2016.2609838"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2020.2982115"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2019.2962437"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2019.2942548"},{"issue":"2","key":"e_1_3_1_10_2","first-page":"25","article-title":"Exploiting sparsity to accelerate fully connected layers of CNN-Based applications on mobile SoCs","volume":"17","author":"Xie X.","year":"2017","unstructured":"X. Xie, D. Du, Q. Li, et al. 2017. Exploiting sparsity to accelerate fully connected layers of CNN-Based applications on mobile SoCs. ACM Transactions on Embedded Computing Systems 17, 2 (2017), 1\u201325.","journal-title":"ACM Transactions on Embedded Computing Systems"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3358178"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIE.2020.3032867"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3380548"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI.2019.00015"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2821561"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3004198"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477016"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2018.8351666"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSII.2017.2690919"},{"key":"e_1_3_1_21_2","first-page":"1","article-title":"NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps","author":"Aimar A.","year":"2018","unstructured":"A. Aimar, H. Mostafa, E. Calabrese, et al. 2018. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Transactions on Neural Networks and Learning Systems (2018), 1\u201313.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2017.8050809"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2919527"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847265"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2017.2688340"},{"key":"e_1_3_1_27_2","first-page":"281","article-title":"Minimizing computation in convolutional neural networks","author":"Cong J.","year":"2014","unstructured":"J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning (ICANN\u201914), 281\u2013290.","journal-title":"Artificial Neural Networks and Machine Learning (ICANN\u201914)"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"issue":"4","key":"e_1_3_1_29_2","article-title":"Speeding up convolutional neural networks with low rank expansions","volume":"4","author":"Jaderberg M.","year":"2014","unstructured":"M. Jaderberg, A. Vedaldi, and A. Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. Computer Science 4, 4 (2014), XIII.","journal-title":"Computer Science"},{"key":"e_1_3_1_30_2","volume-title":"Proceedings of the 4th International Conference on Learning Representations (ICLR\u201916)","author":"Han S.","unstructured":"S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR\u201916)."},{"key":"e_1_3_1_31_2","unstructured":"M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or \u22121. 2016."},{"key":"e_1_3_1_32_2","first-page":"15","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201917)","author":"Zhao R.","unstructured":"R. Zhao, W. Song, W. Zhang, T. Xing, J. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201917), ACM, 15\u201324."},{"key":"e_1_3_1_33_2","unstructured":"N. Vasilache J. Johnson M. Mathieu S. Chintala S. Piantino and Y. LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR abs\/1412.7580 (2014)."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.3012323"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2019.2928682"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240838"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_3_1_38_2","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan K.","year":"2014","unstructured":"K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Computer Science (2014).","journal-title":"Computer Science"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.14569\/IJACSA.2018.091062"},{"key":"e_1_3_1_40_2","first-page":"770","article-title":"Deep residual learning for image recognition","volume":"1","author":"He K.","year":"2016","unstructured":"K. He, X. Zhang, S. Ren, et al. 2016. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR'16) 1, 770\u2013778.","journal-title":"Computer Vision and Pattern Recognition (CVPR'16)"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3530818","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3530818","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:25Z","timestamp":1750183765000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3530818"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,9]]},"references-count":39,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,11,30]]}},"alternative-id":["10.1145\/3530818"],"URL":"https:\/\/doi.org\/10.1145\/3530818","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,9]]},"assertion":[{"value":"2021-10-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-06","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}