{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T18:17:11Z","timestamp":1771957031621,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,3,13]],"date-time":"2024-03-13T00:00:00Z","timestamp":1710288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I\/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I\/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8\u00d7 faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8\u00d7 FPS improvement on the residual channel attention network and 3.1\u00d7 on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.<\/jats:p>","DOI":"10.1145\/3617836","type":"journal-article","created":{"date-parts":[[2023,9,13]],"date-time":"2023-09-13T12:19:42Z","timestamp":1694607582000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7895-5542","authenticated-orcid":false,"given":"Xijie","family":"Jia","sequence":"first","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6524-3263","authenticated-orcid":false,"given":"Yu","family":"Zhang","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5515-4741","authenticated-orcid":false,"given":"Guangdong","family":"Liu","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9852-6762","authenticated-orcid":false,"given":"Xinlin","family":"Yang","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3908-7962","authenticated-orcid":false,"given":"Tianyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0893-9327","authenticated-orcid":false,"given":"Jia","family":"Zheng","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-9546-3029","authenticated-orcid":false,"given":"Dongdong","family":"Xu","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6546-0337","authenticated-orcid":false,"given":"Zhuohuan","family":"Liu","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6983-4773","authenticated-orcid":false,"given":"Mengke","family":"Liu","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3320-3974","authenticated-orcid":false,"given":"Xiaoyang","family":"Yan","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2874-2168","authenticated-orcid":false,"given":"Hong","family":"Wang","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-9988-4712","authenticated-orcid":false,"given":"Rongzhang","family":"Zheng","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-3347-1518","authenticated-orcid":false,"given":"Li","family":"Wang","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7369-1445","authenticated-orcid":false,"given":"Dong","family":"Li","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0341-8452","authenticated-orcid":false,"given":"Satyaprakash","family":"Pareek","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3415-2077","authenticated-orcid":false,"given":"Jian","family":"Weng","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6734-1984","authenticated-orcid":false,"given":"Lu","family":"Tian","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-1668-4616","authenticated-orcid":false,"given":"Dongliang","family":"Xie","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-3463-3238","authenticated-orcid":false,"given":"Hong","family":"Luo","sequence":"additional","affiliation":[{"name":"AMD, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7549-8738","authenticated-orcid":false,"given":"Yi","family":"Shan","sequence":"additional","affiliation":[{"name":"PhiGent Robotics, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,3,13]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the Embedded World Conference","author":"Alok G.","year":"2020","unstructured":"G. Alok. 2020. Architecture apocalypse dream architecture for deep learning inference and compute-VERSAL AI core. In Proceedings of the Embedded World Conference."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3390462"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.621"},{"key":"e_1_3_1_5_2","first-page":"529","article-title":"Collapsible linear blocks for super-efficient super resolution","volume":"4","author":"Bhardwaj Kartikeya","year":"2022","unstructured":"Kartikeya Bhardwaj, Milos Milosavljevic, Liam O\u2019Neil, Dibakar Gope, Ramon Matas, Alex Chalfin, Naveen Suda, Lingchuan Meng, and Danny Loh. 2022. Collapsible linear blocks for super-efficient super resolution. Proceedings of Machine Learning and Systems 4 (2022), 529\u2013547.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00567"},{"key":"e_1_3_1_7_2","first-page":"1","volume-title":"Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC\u201920)","author":"Chatarasi Prasanth","year":"2020","unstructured":"Prasanth Chatarasi, Stephen Neuendorffer, Samuel Bayliss, Kees Vissers, and Vivek Sarkar. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the Xilinx AI engine. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC\u201920). IEEE, Los Alamitos, CA, 1\u201310."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_3_1_9_2","first-page":"181","volume-title":"Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201921)","author":"Deng Huipeng","year":"2021","unstructured":"Huipeng Deng, Jian Wang, Huafeng Ye, Shanlin Xiao, Xiangyu Meng, and Zhiyi Yu. 2021. 3D-VNPU: A flexible accelerator for 2D\/3D CNNs on FPGA. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201921). IEEE, Los Alamitos, CA, 181\u2013185."},{"key":"e_1_3_1_10_2","article-title":"A guide to convolution arithmetic for deep learning","author":"Dumoulin Vincent","year":"2016","unstructured":"Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).","journal-title":"arXiv preprint arXiv:1603.07285"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2017.12.012"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293906"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2018.2822828"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/COASE.2018.8560564"},{"key":"e_1_3_1_15_2","article-title":"A simple method to reduce off-chip memory accesses on convolutional neural networks","author":"Kim Doyun","year":"2019","unstructured":"Doyun Kim, Kyoung-Young Kim, Sangsoo Ko, and Sanghyuck Ha. 2019. A simple method to reduce off-chip memory accesses on convolutional neural networks. arXiv preprint arXiv:1901.09614 (2019).","journal-title":"arXiv preprint arXiv:1901.09614"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2021.3082868"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11554-019-00925-3"},{"key":"e_1_3_1_18_2","article-title":"OctCNN: A high throughput FPGA accelerator for CNNs using octave convolution algorithm","author":"Lou Wenqi","year":"2022","unstructured":"Wenqi Lou, Lei Gong, Chao Wang, Zidong Du, and Zhou Xuehai. 2022. OctCNN: A high throughput FPGA accelerator for CNNs using octave convolution algorithm. IEEE Transactions on Computers 71, 8 (2022), 1847\u20131859.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_1_19_2","first-page":"293","volume-title":"Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201919)","author":"Lym Sangkug","year":"2019","unstructured":"Sangkug Lym, Donghyuk Lee, Mike O\u2019Connor, Niladrish Chatterjee, and Mattan Erez. 2019. DeLTA: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201919). IEEE, Los Alamitos, CA, 293\u2013303."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.438"},{"key":"e_1_3_1_21_2","first-page":"146","volume-title":"Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC\u201921)","volume":"64","author":"Mo Huiyu","year":"2021","unstructured":"Huiyu Mo, Wenping Zhu, Wenjing Hu, Guangbin Wang, Qiang Li, Ang Li, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. 9.2 A 28nm 12.1 TOPS\/W Dual-Mode CNN processor using effective-weight-based convolution and error-compensation-based prediction. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC\u201921), Vol. 64. IEEE, Los Alamitos, CA, 146\u2013148."},{"key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1109\/ZINC50678.2020.9161768","volume-title":"Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC\u201920)","author":"Pranav K. B.","year":"2020","unstructured":"K. B. Pranav and J. Manikandan. 2020. Design and evaluation of a real-time pedestrian detection system for autonomous vehicles. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC\u201920). IEEE, Los Alamitos, CA, 155\u2013159."},{"key":"e_1_3_1_23_2","doi-asserted-by":"crossref","first-page":"12","DOI":"10.23919\/ICCAS50221.2020.9268234","volume-title":"Proceedings of the 2020 20th International Conference on Control, Automation, and Systems (ICCAS\u201920)","author":"Putro Muhamad Dwisnanto","year":"2020","unstructured":"Muhamad Dwisnanto Putro, Duy-Linh Nguyen, and Kang-Hyun Jo. 2020. Fast eye detector using CPU based lightweight convolutional neural network. In Proceedings of the 2020 20th International Conference on Control, Automation, and Systems (ICCAS\u201920). IEEE, Los Alamitos, CA, 12\u201316."},{"key":"e_1_3_1_24_2","first-page":"250","volume-title":"Proceedings of the 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures, and Processors (ASAP\u201921)","author":"Qasaimeh Murad","year":"2021","unstructured":"Murad Qasaimeh, Joseph Zambreno, and Phillip H. Jones. 2021. An efficient hardware architecture for sparse convolution using linear feedback shift registers. In Proceedings of the 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures, and Processors (ASAP\u201921). IEEE, Los Alamitos, CA, 250\u2013257."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174257"},{"key":"e_1_3_1_26_2","first-page":"767","volume-title":"Proceedings of the 2019 IEEE Region 10 Conference (TENCON\u201919)","author":"Sudha S.","year":"2019","unstructured":"S. Sudha, K. B. Jayanthi, C. Rajasekaran, and T. Sunder. 2019. Segmentation of RoI in medical images using CNN-A comparative study. In Proceedings of the 2019 IEEE Region 10 Conference (TENCON\u201919). IEEE, Los Alamitos, CA, 767\u2013771."},{"key":"e_1_3_1_27_2","article-title":"EfficientNet: Improving accuracy and efficiency through AutoML and model scaling","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Improving accuracy and efficiency through AutoML and model scaling. arXiv preprint arXiv:1905.11946 (2019).","journal-title":"arXiv preprint arXiv:1905.11946"},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1109\/ICRA40945.2020.9197031","volume-title":"Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA\u201920)","author":"Wang Qiang","year":"2020","unstructured":"Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. 2020. FadNet: A fast and accurate network for disparity estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA\u201920). IEEE, Los Alamitos, CA, 101\u2013107."},{"key":"e_1_3_1_29_2","first-page":"136","volume-title":"Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL\u201919)","author":"Wu Di","year":"2019","unstructured":"Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL\u201919). IEEE, Los Alamitos, CA, 136\u2013143."},{"key":"e_1_3_1_30_2","volume-title":"UltraScale Architecture DSP Slice: User Guide (UG579)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. UltraScale Architecture DSP Slice: User Guide (UG579). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/v\/u\/en-US\/ug579-ultrascale-dsp"},{"key":"e_1_3_1_31_2","volume-title":"VCK190 Evaluation Board User Guide (UG1366)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. VCK190 Evaluation Board User Guide (UG1366). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/en-US\/ug1366-vck190-eval-bd"},{"key":"e_1_3_1_32_2","volume-title":"Versal ACAP AI Engine Programming Environment User Guide (UG1076)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Versal ACAP AI Engine Programming Environment User Guide (UG1076). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/en-US\/ug1076-ai-engine-environment"},{"key":"e_1_3_1_33_2","volume-title":"Versal ACAP DSP Engine Architecture Manual (AM004)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual (AM004). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/en-US\/am004-versal-dsp-engine"},{"key":"e_1_3_1_34_2","volume-title":"Vitis AI Library User Guide (UG1354)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Vitis AI Library User Guide (UG1354). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/1.4.1-English\/ug1354-xilinx-ai-sdk\/ZCU102-Evaluation-Kit"},{"key":"e_1_3_1_35_2","volume-title":"Vitis AI Tool","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Vitis AI Tool. Retrieved September 19, 2023 from https:\/\/github.com\/Xilinx\/Vitis-AI"},{"key":"e_1_3_1_36_2","volume-title":"Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393). Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/en-US\/ug1393-vitis-application-acceleration"},{"key":"e_1_3_1_37_2","volume-title":"DPUCVDX8G for Versal ACAPs Product Guide","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. DPUCVDX8G for Versal ACAPs Product Guide. Retrieved September 19, 2023 from https:\/\/docs.xilinx.com\/r\/en-US\/pg389-dpucvdx8g"},{"key":"e_1_3_1_38_2","volume-title":"DPUCZDX8G for Zynq UltraScale+ MPSoCs. Product Guide. Retrieved September 19, 2023 from","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. DPUCZDX8G for Zynq UltraScale+ MPSoCs. Product Guide. Retrieved September 19, 2023 fromhttps:\/\/docs.xilinx.com\/r\/en-US\/pg338-dpu"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2919431"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL57034.2022.00029"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_18"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics10101187"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617836","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3617836","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:57Z","timestamp":1750178277000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617836"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,13]]},"references-count":41,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3617836"],"URL":"https:\/\/doi.org\/10.1145\/3617836","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,13]]},"assertion":[{"value":"2022-12-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-11","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}