{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T15:39:26Z","timestamp":1774539566488,"version":"3.50.1"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,27]],"date-time":"2024-01-27T00:00:00Z","timestamp":1706313600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["UIDB\/50021\/2020 and IPL\/2022\/eS2ST_ISEL"],"award-info":[{"award-number":["UIDB\/50021\/2020 and IPL\/2022\/eS2ST_ISEL"]}]},{"DOI":"10.13039\/501100022402","name":"Instituto Polit\u00e9cnico de Lisboa","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100022402","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and\/or layer types. Also, the architectures generated by these tools target, in general, high-performance devices, not appropriate for embedded computing. This article proposes a multi-engine architecture and a design flow to implement deep learning models on FPGA. The hardware design uses high-level synthesis to allow design space exploration. The architecture is scalable and therefore applicable to any density FPGAs. The architecture and design flow were applied to the development of a hardware\/software system for image classification with ResNet50, object detection with YOLOv3-Tiny, and image segmentation with DeepLabV3+. The system was tested in a low-density Zynq UltraScale+ ZU3EG FPGA to show its scalability. The results show that the proposed multi-engine architecture generates efficient accelerators. An accelerator of ResNet50 with a 4-bit quantization achieves 67 FPS, and the object detector with YOLOv3-Tiny with a throughput of 36 FPS and the image segmentation application achieves 1.4 FPS.<\/jats:p>","DOI":"10.1145\/3615870","type":"journal-article","created":{"date-parts":[[2023,10,10]],"date-time":"2023-10-10T11:27:52Z","timestamp":1696937272000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-3724-0763","authenticated-orcid":false,"given":"Miguel","family":"Reis","sequence":"first","affiliation":[{"name":"INESC-ID, Instituto Superior T\u00e9cnico, Universidade de Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8556-4507","authenticated-orcid":false,"given":"M\u00e1rio","family":"V\u00e9stias","sequence":"additional","affiliation":[{"name":"INESC-ID, ISEL, Instituto Polit\u00e9cnico de Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3621-8322","authenticated-orcid":false,"given":"Hor\u00e1cio","family":"Neto","sequence":"additional","affiliation":[{"name":"INESC-ID, Instituto Superior T\u00e9cnico, Universidade de Lisboa, Portugal"}]}],"member":"320","published-online":{"date-parts":[[2024,1,27]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Kamel Abdelouahab Maxime Pelcat Jocelyn Serot and Fran\u00e7ois Berry. 2018. Accelerating CNN inference on FPGAs: A Survey. (2018). arxiv:cs.DC\/1806.01683"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3120629"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS45731.2020.9180843"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783725"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3527156"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSII.2018.2865896"},{"key":"e_1_3_2_8_2","article-title":"Learning on hardware: A tutorial on neural network accelerators and co-processors","volume":"2104","author":"Baischer Lukas","year":"2021","unstructured":"Lukas Baischer, Matthias Wess, and Nima Taherinejad. 2021. Learning on hardware: A tutorial on neural network accelerators and co-processors. CoRR abs\/2104.09252 (2021).","journal-title":"CoRR"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.3390\/jlpea12010011"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","unstructured":"Suhail Basalama Atefeh Sohrabizadeh Jie Wang Licheng Guo and Jason Cong. 2023. FlexCNN: An end-to-end framework for composing CNN accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 16 2 Article 23 (June 2023) 32 pages. 10.1145\/3570928","DOI":"10.1145\/3570928"},{"key":"e_1_3_2_11_2","article-title":"FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks","volume":"1809","author":"Blott Michaela","year":"2018","unstructured":"Michaela Blott, Thomas B. Preu\u00dfer, Nicholas J. Fraser, Giulio Gambardella, Kenneth O\u2019Brien, and Yaman Umuroglu. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. CoRR abs\/1809.04570 (2018).","journal-title":"CoRR"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS45731.2020.9180402"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00027"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Liang-Chieh Chen Yukun Zhu George Papandreou Florian Schroff and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. (2018). arxiv:cs.CV\/1802.02611","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","unstructured":"FastML Team. 2021. fastmachinelearning\/hls4ml. DOI:10.5281\/zenodo.1201549","DOI":"10.5281\/zenodo.1201549"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.vlsi.2019.07.005"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2857078"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2705069"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCRD54409.2022.9730377"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-020-9414-8"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2021.3055814"},{"key":"e_1_3_2_22_2","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. (2015). arxiv:cs.LG\/1502.03167"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3180829"},{"key":"e_1_3_2_24_2","article-title":"Caffe: Convolutional Architecture for Fast Feature Embedding","author":"Jia Yangqing","year":"2014","unstructured":"Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).","journal-title":"arXiv preprint arXiv:1408.5093"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3594221"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2018.00018"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","unstructured":"S. Minaee Y. Boykov F. Porikli A. Plaza N. Kehtarnavaz and D. Terzopoulos. 2022. Image segmentation using deep learning: A survey. In IEEE Transactions on Pattern Analysis and Machine Intelligence 44 7 (2022) 3523\u20133542. DOI:10.1109\/TPAMI.2021.3059968","DOI":"10.1109\/TPAMI.2021.3059968"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530424"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2019.2905242"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","unstructured":"Alessandro Pappalardo. 2022. Xilinx\/Brevitas. DOI:10.5281\/zenodo.3333552","DOI":"10.5281\/zenodo.3333552"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3081818"},{"key":"e_1_3_2_32_2","first-page":"1","volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Sharma Hardik","year":"2016","unstructured":"Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1\u201312."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080221"},{"key":"e_1_3_2_34_2","unstructured":"Kousai Smeda. 2019. Understand the architecture of CNN. Retrieved from https:\/\/towardsdatascience.com\/understand-the-architecture-of-cnn-90a25e244c7"},{"key":"e_1_3_2_35_2","volume-title":"30th ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201922)","author":"Sun Mengshu","year":"2022","unstructured":"Mengshu Sun, Zhengang Li, Alec Lu, Yanyu Li, Sung-En Chang, Xiaolong Ma, Xue Lin, and Zhenman Fang. 2022. FILM-QNN: Efficient FPGA acceleration of deep neural networks with intra-layer, mixed-precision quantization. In 30th ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA\u201922)."},{"key":"e_1_3_2_36_2","article-title":"FINN: A framework for fast, scalable binarized neural network inference","volume":"1612","author":"Umuroglu Yaman","year":"2016","unstructured":"Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Heng Wai Leong, Magnus Jahre, and Kees A. Vissers. 2016. FINN: A framework for fast, scalable binarized neural network inference. CoRR abs\/1612.07119 (2016).","journal-title":"CoRR"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2016.22"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2018.00072"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3186332"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2020.103136"},{"key":"e_1_3_2_41_2","unstructured":"Chi-Feng Wang. 2018. A Basic Introduction to Separable Convolutions. Retrieved from https:\/\/towardsdatascience.com\/a-basic-introduction-to-separable-convolutions-b99ec3102728"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2898003"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_3_2_44_2","article-title":"The ALAMO approach to machine learning","volume":"1705","author":"Wilson Zachary T.","year":"2017","unstructured":"Zachary T. Wilson and Nikolaos V. Sahinidis. 2017. The ALAMO approach to machine learning. CoRR abs\/1705.10918 (2017).","journal-title":"CoRR"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00030"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1155\/2020\/8861886"},{"key":"e_1_3_2_47_2","first-page":"330","volume-title":"Applied Reconfigurable Computing. Architectures, Tools, and Applications","author":"Yu Zhewen","year":"2020","unstructured":"Zhewen Yu and Christos-Savvas Bouganis. 2020. A parameterisable FPGA-tailored architecture for YOLOv3-Tiny. In Applied Reconfigurable Computing. Architectures, Tools, and Applications, Fernando Rinc\u00f3n, Jes\u00fas Barba, Hayden K. H. So, Pedro Diniz, and Juli\u00e1n Caba (Eds.). Springer International Publishing, Cham, 330\u2013344."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2020.3030663"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3615870","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3615870","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:18Z","timestamp":1750295418000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3615870"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,27]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3615870"],"URL":"https:\/\/doi.org\/10.1145\/3615870","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,27]]},"assertion":[{"value":"2023-02-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-04","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}