{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T06:19:34Z","timestamp":1767853174824,"version":"3.49.0"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,11]],"date-time":"2023-03-11T00:00:00Z","timestamp":1678492800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"EU Horizon 2020 Programme","award":["957269"],"award-info":[{"award-number":["957269"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this article, we propose an automated tool flow from a domain-specific language for tensor expressions to generate massively parallel accelerators on high-bandwidth-memory-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use\n            <jats:bold>computational fluid dynamics (CFD)<\/jats:bold>\n            as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines a multi-level intermediate representation\u2013based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25\u00d7 more energy efficient than expert-crafted Intel CPU implementations.\n          <\/jats:p>","DOI":"10.1145\/3563553","type":"journal-article","created":{"date-parts":[[2022,9,15]],"date-time":"2022-09-15T09:56:18Z","timestamp":1663235778000},"page":"1-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Automatic Creation of High-bandwidth Memory Architectures from Domain-specific Languages: The Case of Computational Fluid Dynamics"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7379-8007","authenticated-orcid":false,"given":"Stephanie","family":"Soldavini","sequence":"first","affiliation":[{"name":"Politecnico di Milano, Milano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9534-3978","authenticated-orcid":false,"given":"Karl","family":"Friebel","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t Dresden, Dresden, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1113-3987","authenticated-orcid":false,"given":"Mattia","family":"Tibaldi","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Milano, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4737-8612","authenticated-orcid":false,"given":"Gerald","family":"Hempel","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t Dresden, Dresden, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5007-445X","authenticated-orcid":false,"given":"Jeronimo","family":"Castrillon","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t Dresden, Dresden, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9315-1788","authenticated-orcid":false,"given":"Christian","family":"Pilato","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Milano, Italy"}]}],"member":"320","published-online":{"date-parts":[[2023,3,11]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3183895"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.13140\/RG.2.2.15009.10082"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3278122.3278131"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2017.06.012"},{"key":"e_1_3_2_6_2","unstructured":"Mart\u00edn Abadi Paul Barham Jianmin Chen Zhifeng Chen Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Geoffrey Irving Michael Isard Manjunath Kudlur Josh Levenberg Rajat Monga Sherry Moore Derek G. Murray Benoit Steiner Paul Tucker Vijay Vasudevan Pete Warden Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. arxiv:1605.08695 [cs.DC]. Retrieved from https:\/\/arxiv.org\/abs\/1605.08695."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1903.01855"},{"key":"e_1_3_2_8_2","unstructured":"AMD 2021. 2nd gen AMD EPYC 7282. AMD. Retrieved from https:\/\/www.amd.com\/en\/products\/cpu\/amd-epyc-7282."},{"issue":"1","key":"e_1_3_2_9_2","first-page":"48","article-title":"Scheduling algorithms for high-level synthesis","volume":"5","author":"Baruch Zoltan","year":"1996","unstructured":"Zoltan Baruch. 1996. Scheduling algorithms for high-level synthesis. ACAM Sci. J. 5, 1-2 (1996), 48\u201357.","journal-title":"ACAM Sci. J."},{"key":"e_1_3_2_10_2","first-page":"3","volume-title":"Proceedings of the Python in Science Conference (SciPy\u201910)","author":"Bergstra James","year":"2010","unstructured":"James Bergstra, Olivier Breuleux, Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python in Science Conference (SciPy\u201910). 3\u201310."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT52795.2021.00009"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00022"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-29400-7_34"},{"key":"e_1_3_2_14_2","volume-title":"Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201918). USENIX Association."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/1862648.1862653"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439301"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-26408-0_8"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062208"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18074.2021.9586110"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/Cluster48925.2021.00112"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/Cluster48925.2021.00116"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476206"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE48585.2020.9116317"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3469030"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW52791.2021.00030"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2021.3075765"},{"key":"e_1_3_2_27_2","first-page":"337","volume-title":"Proceedings of the International Conference on Parallel Processing and Applied Mathematics","author":"Huismann Immo","year":"2017","unstructured":"Immo Huismann, Matthias Lieber, J\u00f6rg Stiller, and Jochen Fr\u00f6hlich. 2017. Load balancing for cpu-gpu coupling in computational fluid dynamics. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics. 337\u2013347."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/IMW.2017.7939084"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL50879.2020.00013"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192379"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL50879.2020.00056"},{"key":"e_1_3_2_32_2","unstructured":"Young kyu Choi Yuze Chi Jie Wang Licheng Guo and Jason Cong. 2020. When HLS meets FPGA HBM: Benchmarking and bandwidth optimization. arxiv:2010.06075 [cs.AR]. Retrieved from https:\/\/arxiv.org\/abs\/2010.06075."},{"key":"e_1_3_2_33_2","unstructured":"Yi-Hsiang Lai Yuze Chi Yuwei Hu Jie Wang Cody Hao Yu Yuan Zhou Jason Cong and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. Association for Computing Machinery New York NY."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3469660"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_3_2_36_2","article-title":"MLIR: A compiler infrastructure for the end of Moore\u2019s law","author":"Lattner Chris","year":"2020","unstructured":"Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A compiler infrastructure for the end of Moore\u2019s law. arXiv:2002.11054. Retrieved from https:\/\/arxiv.org\/abs\/2002.11054.","journal-title":"arXiv:2002.11054"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439463"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00027"},{"key":"e_1_3_2_39_2","article-title":"Spectral element methods for the incompressible Navier-Stokes equations","author":"Maday Yvon","year":"1989","unstructured":"Yvon Maday and Anthony T. Patera. 1989. Spectral element methods for the incompressible Navier-Stokes equations. InState-of-the-art Surveys on Computational Mechanics. American Society of Mechanical Engineers, New York, 71\u2013143.","journal-title":"State-of-the-art Surveys on Computational Mechanics"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/H2RC51942.2020.00007"},{"key":"e_1_3_2_41_2","unstructured":"William S. Moses Lorenzo Chelini Ruizhe Zhao and Oleksandr Zinenko. 2021. Polygeist: Affine C in MLIR. Retrieved from https:\/\/acohen.gitlabpages.inria.fr\/impact\/impact2021\/papers\/IMPACT_2021_paper_1.pdf."},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2513673"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/PMBS51919.2020.00007"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE51398.2021.9473940"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2016.2611506"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_2_47_2","first-page":"1","volume-title":"Proceedings of the International Workshop on FPGAs for Software Programmers (FSP\u201919)","author":"Rajagopala Abhi D.","year":"2019","unstructured":"Abhi D. Rajagopala, Ron Sass, and Andrew Schmidt. 2019. Impact of off-chip memories on HLS-generated circuits. In Proceedings of the International Workshop on FPGAs for Software Programmers (FSP\u201919). 1\u201310."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3315454.3329959"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPT.2007.4439254"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/eScience.2011.51"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00017"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2003.1202420"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3088396"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL50879.2020.00014"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.3012318"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415730"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1177\/1094342018816368"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-16214-0_42"},{"key":"e_1_3_2_59_2","article-title":"Composable and modular code generation in MLIR: A structured and retargetable approach to tensor compiler construction","author":"Vasilache Nicolas","year":"2022","unstructured":"Nicolas Vasilache, Oleksandr Zinenko, Aart J. C. Bik, Mahesh Ravishankar, Thomas Raoux, Alexander Belyaev, Matthias Springer, Tobias Gysi, Diego Caballero, Stephan Herhut, et\u00a0al. 2022. Composable and modular code generation in MLIR: A structured and retargetable approach to tensor compiler construction. arXiv:2202.03293. Retrieved from https:\/\/arxiv.org\/abs\/2202.03293.","journal-title":"arXiv:2202.03293"},{"key":"e_1_3_2_60_2","article-title":"Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions","author":"Vasilache Nicolas","year":"2018","unstructured":"Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730. Retrieved from https:\/\/arxiv.org\/abs\/1802.04730.","journal-title":"arXiv:1802.04730"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415643"},{"key":"e_1_3_2_62_2","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1007\/978-3-642-15582-6_49","volume-title":"Mathematical Software\u2013ICMS 2010","author":"Verdoolaege Sven","year":"2010","unstructured":"Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In Mathematical Software\u2013ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 299\u2013302."},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/UIC-ATC-ScalCom-CBDCom-IoP.2015.199"},{"key":"e_1_3_2_64_2","unstructured":"Xilinx. 2021. Vitis 2021.1 Acceleration Environment. Retrieved from https:\/\/www.xilinx.com\/support\/documentation-navigation\/design-hubs\/2021-1\/dh0088-vitis-acceleration.html."},{"key":"e_1_3_2_65_2","article-title":"Phism: Polyhedral high-level synthesis in MLIR","author":"Zhao Ruizhe","year":"2021","unstructured":"Ruizhe Zhao and Jianyi Cheng. 2021. Phism: Polyhedral high-level synthesis in MLIR. arXiv:2103.15103 [cs]. https:\/\/arXiv:2103.15103.","journal-title":"arXiv:2103.15103 [cs]"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/eScience.2016.7870932"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563553","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3563553","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:36Z","timestamp":1750182576000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563553"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,11]]},"references-count":65,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3563553"],"URL":"https:\/\/doi.org\/10.1145\/3563553","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,11]]},"assertion":[{"value":"2022-02-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-28","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}