{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T11:36:23Z","timestamp":1778067383067,"version":"3.51.4"},"reference-count":94,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,8,12]],"date-time":"2021-08-12T00:00:00Z","timestamp":1628726400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various incarnations of stencil-based computations,<jats:bold>Iterative Stencil Loops (ISLs)<\/jats:bold>and<jats:bold>Convolutional Neural Networks (CNNs)<\/jats:bold>represent two well-known examples of kernels belonging to the stencil class. Indeed, ISLs apply the same stencil several times until convergence, while CNN layers leverage stencils to extract features from an image. The computationally intensive essence of ISLs, CNNs, and in general stencil-based workloads, requires solutions able to produce efficient implementations in terms of throughput and power efficiency. In this context, FPGAs are ideal candidates for such workloads, as they allow design architectures tailored to the stencil regular computational pattern. Moreover, the ever-growing need for performance enhancement leads FPGA-based architectures to scale to multiple devices to benefit from a distributed acceleration. For this reason, we propose a library of HDL components to effectively compute ISLs and CNNs inference on FPGA, along with a scalable multi-FPGA architecture, based on custom PCB interconnects. Our solution eases the design flow and guarantees both scalability and performance competitive with state-of-the-art works.<\/jats:p>","DOI":"10.1145\/3461478","type":"journal-article","created":{"date-parts":[[2021,8,12]],"date-time":"2021-08-12T14:51:22Z","timestamp":1628779882000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components"],"prefix":"10.1145","volume":"14","author":[{"given":"Enrico","family":"Reggiani","sequence":"first","affiliation":[{"name":"Barcelona Supercomputing Center, Spain and Politecnico di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Emanuele","family":"Del Sozzo","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Davide","family":"Conficconi","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Giuseppe","family":"Natale","sequence":"additional","affiliation":[{"name":"Xilinx Research Labs, Ireland and Politecnico di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carlo","family":"Moroni","sequence":"additional","affiliation":[{"name":"E-lysis, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marco D.","family":"Santambrogio","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,8,12]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201919)","author":"Ahmad A.","unstructured":"A. Ahmad and M. A. Pasha . 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs . In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201919) . 1106\u20131111. A. Ahmad and M. A. Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE\u201919). 1106\u20131111."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3380548"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2009.09.002"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021738"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917)","author":"Bacis M.","year":"2017","unstructured":"M. Bacis , G. Natale , E. Del Sozzo , and M. D. Santambrogio . 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917) . 90\u201397. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2017 .44 10.1109\/IPDPSW.2017.44 M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio. 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201917). 90\u201397. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2017.44"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389051"},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Uday Bondhugula. 2008. PLUTO - An automatic parallelizer and locality optimizer for affine loop nests. Retrieved from http:\/\/pluto-compiler.sourceforge.net\/. Uday Bondhugula. 2008. PLUTO - An automatic parallelizer and locality optimizer for affine loop nests. Retrieved from http:\/\/pluto-compiler.sourceforge.net\/.","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_2_1_8_1","unstructured":"Uday Bondhugula. 2008. PLUTO Compiler Repository - Examples. Retrieved from https:\/\/github.com\/bondhugula\/pluto\/tree\/master\/examples. Uday Bondhugula. 2008. PLUTO Compiler Repository - Examples. Retrieved from https:\/\/github.com\/bondhugula\/pluto\/tree\/master\/examples."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1788374.1788386"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1379022.1375595"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2012.04.017"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2842615"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195647"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240850"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.70"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439291"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2593069.2593090"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2488491"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201916)","author":"Sozzo E. Del","year":"2016","unstructured":"E. Del Sozzo , A. Solazzo , A. Miele , and M. D. Santambrogio . 2016. On the automation of high level synthesis of convolutional neural networks . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201916) . 217\u2013224. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2016 .153 10.1109\/IPDPSW.2016.153 E. Del Sozzo, A. Solazzo, A. Miele, and M. D. Santambrogio. 2016. On the automation of high level synthesis of convolutional neural networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201916). 217\u2013224. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2016.153"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2005.09.021"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2009.5272559"},{"key":"e_1_2_1_22_1","volume-title":"Polyhedron model.Encyclopedia of Parallel Computing 1","author":"Feautrier Paul","year":"2011","unstructured":"Paul Feautrier and Christian Lengauer . 2011. Polyhedron model.Encyclopedia of Parallel Computing 1 ( 2011 ), 1581\u20131592. Paul Feautrier and Christian Lengauer. 2011. Polyhedron model.Encyclopedia of Parallel Computing 1 (2011), 1581\u20131592."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-007-0111-y"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS\u201917)","author":"Gokhale V.","unstructured":"V. Gokhale , A. Zaidy , A. X. M. Chang , and E. Culurciello . 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks . In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS\u201917) . 1\u20134. V. Gokhale, A. Zaidy, A. X. M. Chang, and E. Culurciello. 2017. Snowflake: An efficient hardware accelerator for convolutional neural networks. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS\u201917). 1\u20134."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2705069"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3289185"},{"key":"e_1_2_1_28_1","volume-title":"Dally","author":"Han Song","year":"2015","unstructured":"Song Han , Huizi Mao , and William J . Dally . 2015 . Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015). Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/573190"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304619"},{"key":"e_1_2_1_31_1","unstructured":"Amazon Inc.2018. EC2 F1 Instances. Retrieved from https:\/\/aws.amazon.com\/it\/ec2\/instance-types\/f1\/. Amazon Inc.2018. EC2 F1 Instances. Retrieved from https:\/\/aws.amazon.com\/it\/ec2\/instance-types\/f1\/."},{"key":"e_1_2_1_32_1","unstructured":"Microsoft Inc.2018. Project Brainwave. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/blog\/mi crosoft-unveils-project-brainwave\/. Microsoft Inc.2018. Project Brainwave. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/blog\/mi crosoft-unveils-project-brainwave\/."},{"key":"e_1_2_1_33_1","unstructured":"Xilinx Inc.2018. Aurora 64B\/66B link-layer protocol.g. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/ip_documentation\/aurora_64b66b_protocol_spec_sp011.pdf. Xilinx Inc.2018. Aurora 64B\/66B link-layer protocol.g. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/ip_documentation\/aurora_64b66b_protocol_spec_sp011.pdf."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1137\/07070574X"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080246"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-45234-8_73"},{"key":"e_1_2_1_38_1","volume-title":"Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs\/1806.08342","author":"Krishnamoorthi Raghuraman","year":"2018","unstructured":"Raghuraman Krishnamoorthi . 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs\/1806.08342 ( 2018 ). Raghuraman Krishnamoorthi. 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs\/1806.08342 (2018)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_40_1","volume-title":"Fast algorithms for convolutional neural networks. CoRR abs\/1509.09308","author":"Lavin Andrew","year":"2015","unstructured":"Andrew Lavin . 2015. Fast algorithms for convolutional neural networks. CoRR abs\/1509.09308 ( 2015 ). Retrieved from https:\/\/arxiv.org\/pdf\/1704.04760.pdf. Andrew Lavin. 2015. Fast algorithms for convolutional neural networks. CoRR abs\/1509.09308 (2015). Retrieved from https:\/\/arxiv.org\/pdf\/1704.04760.pdf."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 26th International Conference on Field-programmable Logic and Applications (FPL\u201916)","author":"Li Huimin","year":"2016","unstructured":"Huimin Li , Xitian Fan , Li Jiao , Wei Cao , Xuegong Zhou , and Lingli Wang . 2016 . A high performance FPGA-based accelerator for large-scale convolutional neural networks . In Proceedings of the 26th International Conference on Field-programmable Logic and Applications (FPL\u201916) . IEEE, 1\u20139. Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 26th International Conference on Field-programmable Logic and Applications (FPL\u201916). IEEE, 1\u20139."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2897701"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the International Conference on Field-programmable Technology (FPT\u201916)","author":"Liu Zhiqiang","year":"2016","unstructured":"Zhiqiang Liu , Yong Dou , Jingfei Jiang , and Jinwei Xu . 2016 . Automatic code generation of convolutional neural networks in FPGA implementation . In Proceedings of the International Conference on Field-programmable Technology (FPT\u201916) . IEEE, 61\u201368. Zhiqiang Liu, Yong Dou, Jingfei Jiang, and Jinwei Xu. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the International Conference on Field-programmable Technology (FPT\u201916). IEEE, 61\u201368."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021736"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.vlsi.2017.12.009"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL\u201914)","author":"Markettos A. Theodore","unstructured":"A. Theodore Markettos , Paul J. Fox , Simon W. Moore , and Andrew W. Moore . 2014. Interconnect for commodity FPGA clusters: Standardized or customized? In Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL\u201914) . IEEE, 1\u20138. A. Theodore Markettos, Paul J. Fox, Simon W. Moore, and Andrew W. Moore. 2014. Interconnect for commodity FPGA clusters: Standardized or customized? In Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL\u201914). IEEE, 1\u20138."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1029\/96JC02775"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542275.1542313"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201918)","author":"Mondigo A.","unstructured":"A. Mondigo , K. Sano , and H. Takizawa . 2018. Performance Estimation of deeply pipelined fluid simulation on multiple FPGAs with high-speed communication subsystem . In Proceedings of the IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201918) . 1\u20134. A. Mondigo, K. Sano, and H. Takizawa. 2018. Performance Estimation of deeply pipelined fluid simulation on multiple FPGAs with high-speed communication subsystem. In Proceedings of the IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201918). 1\u20134."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488797"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/0010-4655(94)90048-5"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2966986.2966995"},{"key":"e_1_2_1_54_1","unstructured":"NVIDIA. 2018. TensorRT. Retrieved from https:\/\/developer.nvidia.com\/tensorrt. NVIDIA. 2018. TensorRT. Retrieved from https:\/\/developer.nvidia.com\/tensorrt."},{"key":"e_1_2_1_55_1","volume-title":"NIPS 2017 Workshop Autodiff Submission. Retrieved on","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke , Sam Gross , Soumith Chintala , Gregory Chanan , Edward Yang , Zachary DeVito , Zeming Lin , Alban Desmaison , Luca Antiga , and Adam Lerer . 2017 . Automatic differentiation in PyTorch . NIPS 2017 Workshop Autodiff Submission. Retrieved on 28 Oct, 2017 from https:\/\/openreview.net\/forum?.id=BJJsrmfCZ. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. NIPS 2017 Workshop Autodiff Submission. Retrieved on 28 Oct, 2017 from https:\/\/openreview.net\/forum?.id=BJJsrmfCZ."},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201917)","author":"Podili A.","unstructured":"A. Podili , C. Zhang , and V. Prasanna . 2017. Fast and efficient implementation of convolutional neural networks on FPGA . In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201917) . 11\u201318. A. Podili, C. Zhang, and V. Prasanna. 2017. Fast and efficient implementation of convolutional neural networks on FPGA. In Proceedings of the IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP\u201917). 11\u201318."},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS\u201919)","author":"Qasaimeh Murad","unstructured":"Murad Qasaimeh , Kristof Denolf , Jack Lo , Kees Vissers , Joseph Zambreno , and Phillip H. Jones . 2019. Comparing energy efficiency of CPU, GPU and FPGA implementations for vision kernels . In Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS\u201919) . IEEE, 1\u20138. Murad Qasaimeh, Kristof Denolf, Jack Lo, Kees Vissers, Joseph Zambreno, and Phillip H. Jones. 2019. Comparing energy efficiency of CPU, GPU and FPGA implementations for vision kernels. In Proceedings of the IEEE International Conference on Embedded Software and Systems (ICESS\u201919). IEEE, 1\u20138."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847265"},{"key":"e_1_2_1_59_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Raspa N.","year":"2018","unstructured":"N. Raspa , G. Natale , M. Bacis , and M. D. Santambrogio . 2018. A framework with cloud integration for CNN acceleration on FPGA devices . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918) . 170\u2013177. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2018 .00033 10.1109\/IPDPSW.2018.00033 N. Raspa, G. Natale, M. Bacis, and M. D. Santambrogio. 2018. A framework with cloud integration for CNN acceleration on FPGA devices. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). 170\u2013177. DOI:DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2018.00033"},{"key":"e_1_2_1_60_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918)","author":"Reggiani Enrico","unstructured":"Enrico Reggiani , Giuseppe Natale , Carlo Moroni , and Marco D. Santambrogio . 2018. An FPGA-based acceleration methodology and performance model for iterative stencils . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918) . IEEE, 115\u2013122. Enrico Reggiani, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2018. An FPGA-based acceleration methodology and performance model for iterative stencils. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201918). IEEE, 115\u2013122."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2019.00028"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA\u201912)","author":"Richter Franz","year":"2012","unstructured":"Franz Richter , Michael Schmidt , and Dietmar Fey . 2012 . A Configurable VHDL template for parallelization of 3D stencil codes on FPGAs . In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA\u201912) . The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp). Franz Richter, Michael Schmidt, and Dietmar Fey. 2012. A Configurable VHDL template for parallelization of 3D stencil codes on FPGAs. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA\u201912). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2009.25"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.51"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2691770"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.5555\/558008"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195659"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174257"},{"key":"e_1_2_1_69_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1137\/S0036144599363084"},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the IEEE Computer Society Symposium on VLSI (ISVLSI\u201916)","author":"Solazzo A.","year":"2016","unstructured":"A. Solazzo , E. Del Sozzo , I. De Rose , M. De Silvestri , G. C. Durelli , and M. D. Santambrogio . 2016. Hardware design automation of convolutional neural networks . In Proceedings of the IEEE Computer Society Symposium on VLSI (ISVLSI\u201916) . 224\u2013229. DOI:DOI:https:\/\/doi.org\/10.1109\/ISVLSI. 2016 .101 10.1109\/ISVLSI.2016.101 A. Solazzo, E. Del Sozzo, I. De Rose, M. De Silvestri, G. C. Durelli, and M. D. Santambrogio. 2016. Hardware design automation of convolutional neural networks. In Proceedings of the IEEE Computer Society Symposium on VLSI (ISVLSI\u201916). 224\u2013229. DOI:DOI:https:\/\/doi.org\/10.1109\/ISVLSI.2016.101"},{"key":"e_1_2_1_72_1","volume-title":"Parboil: A revised benchmark suite for scientific and commercial throughput computing","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Liu , and Wen mei W. Hwu . 2012 . Parboil: A revised benchmark suite for scientific and commercial throughput computing . Center for Reliable and High-Performance Computing 127 (2012). John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Liu, and Wen mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012)."},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2010.43"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847276"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989493.1989508"},{"key":"e_1_2_1_76_1","unstructured":"Maxeler Technologies. 2015. MPC-X Series. Retrieved from https:\/\/www.maxeler.com\/products\/mpc-xseries\/. Maxeler Technologies. 2015. MPC-X Series. Retrieved from https:\/\/www.maxeler.com\/products\/mpc-xseries\/."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.5555\/3408352.3408729"},{"key":"e_1_2_1_78_1","volume-title":"Proceedings of the 27th International Conference on Field-programmable Logic and Applications (FPL\u201917)","author":"Venieris S. I.","unstructured":"S. I. Venieris and C. Bouganis . 2017. Latency-driven design for FPGA-based convolutional neural networks . In Proceedings of the 27th International Conference on Field-programmable Logic and Applications (FPL\u201917) . 1\u20138. S. I. Venieris and C. Bouganis. 2017. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 27th International Conference on Field-programmable Logic and Applications (FPL\u201917). 1\u20138."},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3186332"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2910824"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2614981"},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062185"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2898003"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01217347"},{"key":"e_1_2_1_86_1","volume-title":"Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853","author":"Yadan Omry","year":"2013","unstructured":"Omry Yadan , Keith Adams , Yaniv Taigman , and Facebook Ai Group . 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 ( 2013 ). Omry Yadan, Keith Adams, Yaniv Taigman, and Facebook Ai Group. 2013. Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853 (2013)."},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2514699"},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174265"},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.1145\/2966986.2967011"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021727"},{"key":"e_1_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/2934583.2934644"},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021698"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3461478","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3461478","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:35Z","timestamp":1750195715000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3461478"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,12]]},"references-count":94,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3461478"],"URL":"https:\/\/doi.org\/10.1145\/3461478","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,8,12]]},"assertion":[{"value":"2020-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-08-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}