{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,1]],"date-time":"2026-01-01T10:07:54Z","timestamp":1767262074699,"version":"3.41.0"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2019,10,8]],"date-time":"2019-10-08T00:00:00Z","timestamp":1570492800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1820537"],"award-info":[{"award-number":["CCF-1820537"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61472052"],"award-info":[{"award-number":["61472052"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2019,10,31]]},"abstract":"<jats:p>Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple\u2019s Siri) and edge computing (e.g., Google\/Waymo\u2019s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, \u201cSuper-LIP\u201d, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48\u00d7 speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.<\/jats:p>","DOI":"10.1145\/3358192","type":"journal-article","created":{"date-parts":[[2019,10,10]],"date-time":"2019-10-10T13:13:05Z","timestamp":1570713185000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":62,"title":["Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference"],"prefix":"10.1145","volume":"18","author":[{"given":"Weiwen","family":"Jiang","sequence":"first","affiliation":[{"name":"East China Normal University, University of Pittsburgh, University of Notre Dame"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Edwin H.-M.","family":"Sha","sequence":"additional","affiliation":[{"name":"East China Normal University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinyi","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Pittsburgh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lei","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Pittsburgh"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qingfeng","family":"Zhuge","sequence":"additional","affiliation":[{"name":"East China Normal University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yiyu","family":"Shi","sequence":"additional","affiliation":[{"name":"University of Notre Dame"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingtong","family":"Hu","sequence":"additional","affiliation":[{"name":"University of Pittsburgh"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,10,8]]},"reference":[{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9252--9260","author":"Balakrishnan Guha","key":"e_1_2_1_1_1","unstructured":"Guha Balakrishnan , Amy Zhao , Mert R. Sabuncu , John Guttag , and Adrian V. Dalca . 2018. An unsupervised learning model for deformable medical image registration . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9252--9260 . Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John Guttag, and Adrian V. Dalca. 2018. An unsupervised learning model for deformable medical image registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9252--9260."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.022071131"},{"key":"e_1_2_1_3_1","volume-title":"On the universal approximability and complexity bounds of quantized ReLU neural networks. arXiv preprint arXiv:1802.03646","author":"Ding Yukun","year":"2018","unstructured":"Yukun Ding , Jinglan Liu , Jinjun Xiong , and Yiyu Shi . 2018. On the universal approximability and complexity bounds of quantized ReLU neural networks. arXiv preprint arXiv:1802.03646 ( 2018 ). Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. 2018. On the universal approximability and complexity bounds of quantized ReLU neural networks. arXiv preprint arXiv:1802.03646 (2018)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2018.00021"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI.2016.129"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2857098"},{"key":"e_1_2_1_8_1","volume-title":"arXiv preprint arXiv:1907.04650","author":"Jiang Weiwen","year":"2019","unstructured":"Weiwen Jiang , Lei Yang , Edwin Sha , Qingfeng Zhuge , Shouzhen Gu , Yiyu Shi , and Jingtong Hu. 2019. Hardware\/Software co-exploration of neural architectures. arXiv preprint arXiv:1907.04650 ( 2019 ). Weiwen Jiang, Lei Yang, Edwin Sha, Qingfeng Zhuge, Shouzhen Gu, Yiyu Shi, and Jingtong Hu. 2019. Hardware\/Software co-exploration of neural architectures. arXiv preprint arXiv:1907.04650 (2019)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317757"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_11_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems . 1097--1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021736"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/LES.2018.2815954"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2014.7478821"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_2_1_16_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99."},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Junzhong Shen Deguang Wang You Huang Mei Wen and Chunyuan Zhang. 2019. Accelerating 3D CNN-based lung nodule segmentation on a multi-FPGA system. In FPGA. 117.  Junzhong Shen Deguang Wang You Huang Mei Wen and Chunyuan Zhang. 2019. Accelerating 3D CNN-based lung nodule segmentation on a multi-FPGA system. In FPGA. 117.","DOI":"10.1145\/3289602.3293935"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317906"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080221"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847276"},{"volume-title":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40--47","author":"Stylianos","key":"e_1_2_1_21_1","unstructured":"Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs . In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40--47 . Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40--47."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2873210"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2868062"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2791440"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1347375.1347389"},{"key":"e_1_2_1_26_1","volume-title":"Rui Mao, Zili Shao, and Tao Li.","author":"Wu Shangyu","year":"2019","unstructured":"Shangyu Wu , Yi Wang , Amelie Chi Zhou , Rui Mao, Zili Shao, and Tao Li. 2019 . Towards cross-platform inference on edge devices with emerging neuromorphic architecture. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE , 806--811. Shangyu Wu, Yi Wang, Amelie Chi Zhou, Rui Mao, Zili Shao, and Tao Li. 2019. Towards cross-platform inference on edge devices with emerging neuromorphic architecture. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 806--811."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41928-018-0059-3"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00866"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISQED.2018.8357326"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062323"},{"key":"e_1_2_1_31_1","volume-title":"Dutt","author":"Yang Lei","year":"2018","unstructured":"Lei Yang , Weichen Liu , Nan Guan , and Nikil D . Dutt . 2018 . Optimal application mapping and scheduling for network-on-chips with computation in STT-RAM based router. IEEE Trans. Comput . (2018). Lei Yang, Weichen Liu, Nan Guan, and Nikil D. Dutt. 2018. Optimal application mapping and scheduling for network-on-chips with computation in STT-RAM based router. IEEE Trans. Comput. (2018)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2643669"},{"key":"e_1_2_1_33_1","volume-title":"Recent trends in deep learning based natural language processing","author":"Young Tom","year":"2018","unstructured":"Tom Young , Devamanyu Hazarika , Soujanya Poria , and Erik Cambria . 2018. Recent trends in deep learning based natural language processing . IEEE Computational intelligenCe Magazine 13, 3 ( 2018 ), 55--75. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational intelligenCe Magazine 13, 3 (2018), 55--75."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934583.2934644"},{"key":"e_1_2_1_36_1","volume-title":"Freedman","author":"Zhang Haoyu","year":"2017","unstructured":"Haoyu Zhang , Ganesh Ananthanarayanan , Peter Bodik , Matthai Philipose , Paramvir Bahl , and Michael J . Freedman . 2017 . Live video analytics at scale with approximation and delay-tolerance. In 14th &lbrace;USENIX&rbrace; Symposium on Networked Systems Design and Implementation ( &lbrace;NSDI&rbrace; 17). 377--392. Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live video analytics at scale with approximation and delay-tolerance. In 14th &lbrace;USENIX&rbrace; Symposium on Networked Systems Design and Implementation (&lbrace;NSDI&rbrace; 17). 377--392."},{"volume-title":"Automation 8 Test in Europe Conference 8 Exhibition (DATE)","author":"Zhang Wentai","key":"e_1_2_1_37_1","unstructured":"Wentai Zhang , Jiaxi Zhang , Minghua Shen , Guojie Luo , and Nong Xiao . 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In 2019 Design , Automation 8 Test in Europe Conference 8 Exhibition (DATE) . IEEE , 1241--1244. Wentai Zhang, Jiaxi Zhang, Minghua Shen, Guojie Luo, and Nong Xiao. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1241--1244."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358192","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3358192","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3358192","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:32:58Z","timestamp":1750199578000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358192"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,8]]},"references-count":38,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2019,10,31]]}},"alternative-id":["10.1145\/3358192"],"URL":"https:\/\/doi.org\/10.1145\/3358192","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2019,10,8]]},"assertion":[{"value":"2019-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}