{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T16:58:17Z","timestamp":1775667497760,"version":"3.50.1"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,4,29]],"date-time":"2021-04-29T00:00:00Z","timestamp":1619654400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["CNS-1562837"],"award-info":[{"award-number":["CNS-1562837"]}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CNS-1619653"],"award-info":[{"award-number":["CNS-1619653"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Science Foundation Career Award","award":["CNS-1629888"],"award-info":[{"award-number":["CNS-1629888"]}]},{"name":"National Science Foundation CAREER award","award":["CNS-1955593"],"award-info":[{"award-number":["CNS-1955593"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Emerg. Technol. Comput. Syst."],"published-print":{"date-parts":[[2021,4,30]]},"abstract":"<jats:p>High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.<\/jats:p>","DOI":"10.1145\/3432816","type":"journal-article","created":{"date-parts":[[2021,4,29]],"date-time":"2021-04-29T22:10:22Z","timestamp":1619734222000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":35,"title":["Toward Multi-FPGA Acceleration of the Neural Networks"],"prefix":"10.1145","volume":"17","author":[{"given":"Saman","family":"Biookaghazadeh","sequence":"first","affiliation":[{"name":"Arizona State University, Tempe, AZ"}]},{"given":"Pravin Kumar","family":"Ravi","sequence":"additional","affiliation":[{"name":"Arizona State University, Tempe, AZ"}]},{"given":"Ming","family":"Zhao","sequence":"additional","affiliation":[{"name":"Arizona State University, Tempe, AZ"}]}],"member":"320","published-online":{"date-parts":[[2021,4,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934583.2934644"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2785257"},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3358192","article-title":"Achieving super-linear speedup across multi-FPGA for real-time DNN inference","volume":"18","author":"Jiang Weiwen","year":"2019","unstructured":"Weiwen Jiang , Edwin H.-M. Sha , Xinyi Zhang , Lei Yang , Qingfeng Zhuge , Yiyu Shi , and Jingtong Hu . 2019 . Achieving super-linear speedup across multi-FPGA for real-time DNN inference . ACM Transactions on Embedded Computing Systems 18 , 5s (2019), 1 -- 23 . Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-FPGA for real-time DNN inference. ACM Transactions on Embedded Computing Systems 18, 5s (2019), 1--23.","journal-title":"ACM Transactions on Embedded Computing Systems"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI \u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , et\u00a0al. 2016 . TensorFlow: A system for large-scale machine learning . In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI \u201916) . 265--283. Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, et\u00a0al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI \u201916). 265--283."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM","author":"Aydonat Utku","unstructured":"Utku Aydonat , Shane O\u2019Connell , Davor Capalija , Andrew C. Ling , and Gordon R. Chiu . 2017. An OpenCL deep learning accelerator on Arria 10 . In Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM , New York, 55--64. Utku Aydonat, Shane O\u2019Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL deep learning accelerator on Arria 10. In Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, 55--64."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.media.2017.07.005"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2017.2743240"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2015.7353481"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 2018 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918)","author":"Hegde Kartik","unstructured":"Kartik Hegde , Rohit Agrawal , Yulun Yao , and Christopher W. Fletcher . 2018. Morph: Flexible acceleration for 3D CNN-based video understanding . In Proceedings of the 2018 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918) . IEEE, Los Alamitos, CA, 933--946. Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W. Fletcher. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. In Proceedings of the 2018 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201918). IEEE, Los Alamitos, CA, 933--946."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 4597--4605","author":"Sun Lin","unstructured":"Lin Sun , Kui Jia , Dit-Yan Yeung , and Bertram E. Shi . 2015. Human action recognition using factorized spatio-temporal convolutional networks . In Proceedings of the IEEE International Conference on Computer Vision. 4597--4605 . Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E. Shi. 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4597--4605."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.59"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174257"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1137\/0209021"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the USENIX Workshop on Hot Topics in Edge Computing (HotEdge\u201918)","author":"Biookaghazadeh Saman","year":"2018","unstructured":"Saman Biookaghazadeh , Ming Zhao , and Fengbo Ren . 2018 . Are FPGAs suitable for edge computing? In Proceedings of the USENIX Workshop on Hot Topics in Edge Computing (HotEdge\u201918) . Saman Biookaghazadeh, Ming Zhao, and Fengbo Ren. 2018. Are FPGAs suitable for edge computing? In Proceedings of the USENIX Workshop on Hot Topics in Edge Computing (HotEdge\u201918)."},{"key":"e_1_2_1_19_1","unstructured":"Intel. n.d. Intel FPGA SDK for Open CL Programming Guide. Intel.  Intel. n.d. Intel FPGA SDK for Open CL Programming Guide. Intel."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPT.2017.8280160"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847276"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2018.2815603"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.3390\/electronics8010065"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242898"},{"key":"e_1_2_1_26_1","volume-title":"Retrieved","year":"2021","unstructured":"Intel. n.d. Fog Reference Unit . Retrieved February 22, 2021 from https:\/\/www.intel.com\/content\/www\/us\/en\/internet-of-things\/fog-reference-design-overview.html Intel. n.d. Fog Reference Unit. Retrieved February 22, 2021 from https:\/\/www.intel.com\/content\/www\/us\/en\/internet-of-things\/fog-reference-design-overview.html"},{"key":"e_1_2_1_27_1","volume-title":"n.d. Home Page. Retrieved","year":"2021","unstructured":"Pytorch. n.d. Home Page. Retrieved February 22, 2021 from https:\/\/pytorch.org Pytorch. n.d. Home Page. Retrieved February 22, 2021 from https:\/\/pytorch.org"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2017.8050344"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"}],"container-title":["ACM Journal on Emerging Technologies in Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3432816","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3432816","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3432816","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:47:11Z","timestamp":1750193231000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3432816"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,29]]},"references-count":29,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,4,30]]}},"alternative-id":["10.1145\/3432816"],"URL":"https:\/\/doi.org\/10.1145\/3432816","relation":{},"ISSN":["1550-4832","1550-4840"],"issn-type":[{"value":"1550-4832","type":"print"},{"value":"1550-4840","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,29]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}