{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T11:36:24Z","timestamp":1778067384767,"version":"3.51.4"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,4,17]],"date-time":"2023-04-17T00:00:00Z","timestamp":1681689600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"crossref","award":["RGPIN-2019-04613 and DGECR-2019-00120"],"award-info":[{"award-number":["RGPIN-2019-04613 and DGECR-2019-00120"]}],"id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Alliance","award":["ALLRP-552042-2020"],"award-info":[{"award-number":["ALLRP-552042-2020"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA\u00a0takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA\u00a0achieves an average speedup of 3.41\u00d7 and up to 15.73\u00d7 speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.<\/jats:p>","DOI":"10.1145\/3572547","type":"journal-article","created":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T12:05:40Z","timestamp":1675166740000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6244-2101","authenticated-orcid":false,"given":"Xingyu","family":"Tian","sequence":"first","affiliation":[{"name":"School of Engineering Science, Simon Fraser Universityr, Burnaby, BC, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0755-8843","authenticated-orcid":false,"given":"Zhifan","family":"Ye","sequence":"additional","affiliation":[{"name":"School of the Gifted Young, University of Science and Technology of China, Jinzhai Road, Hefei, Anhui, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3315-7368","authenticated-orcid":false,"given":"Alec","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada,"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0705-9510","authenticated-orcid":false,"given":"Licheng","family":"Guo","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of California, Los Angeles, Westwood Plaza, Los Angeles, California, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5885-0425","authenticated-orcid":false,"given":"Yuze","family":"Chi","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of California, Los Angeles, Westwood Plaza, Los Angeles, California, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0603-9697","authenticated-orcid":false,"given":"Zhenman","family":"Fang","sequence":"additional","affiliation":[{"name":"School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,4,17]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.partic.2013.11.004"},{"issue":"4","key":"e_1_3_2_3_2","article-title":"On how to accelerate iterative stencil loops: A scalable streaming-based approach","volume":"12","author":"Cattaneo Riccardo","year":"2015","unstructured":"Riccardo Cattaneo, Giuseppe Natale, Carlo Sicignano, Donatella Sciuto, and Marco Domenico Santambrogio. 2015. On how to accelerate iterative stencil loops: A scalable streaming-based approach. ACM Trans. Archit. Code Optim. 12, 4 (Dec.2015).","journal-title":"ACM Trans. Archit. Code Optim."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.5555\/3437539.3437723"},{"key":"e_1_3_2_5_2","first-page":"1","volume-title":"Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design (ICCAD)","author":"Chi Yuze","year":"2018","unstructured":"Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with optimized dataflow architecture. In Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design (ICCAD). 1\u20138."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00032"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2018.00023"},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","first-page":"278","DOI":"10.1145\/2435264.2435344","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA\u201913)","author":"Cooke Patrick","year":"2013","unstructured":"Patrick Cooke, Jeremy Fowers, Lee Hunt, and Greg Stitt. 2013. A high-performance, low-energy FPGA accelerator for correntropy-based feature tracking (Abstract Only). In Proceedings of the ACM\/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA\u201913). Association for Computing Machinery, New York, NY, 278."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2016.10.023"},{"issue":"8","key":"e_1_3_2_10_2","article-title":"High-level synthesis design for stencil computations on FPGA with High bandwidth memory","volume":"9","author":"Du Changdao","year":"2020","unstructured":"Changdao Du and Yoshiki Yamaguchi. 2020. High-level synthesis design for stencil computations on FPGA with High bandwidth memory. Electronics 9, 8 (2020).","journal-title":"Electronics"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174251"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2013.2237915"},{"key":"e_1_3_2_13_2","first-page":"121","volume-title":"Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET)","author":"Firmansyah Iman","year":"2018","unstructured":"Iman Firmansyah, Yusuf Nur Wijayanto, and Yoshiki Yamaguchi. 2018. 2D stencil computation on cyclone V SoC FPGA using OpenCL. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET). 121\u2013124."},{"key":"e_1_3_2_14_2","first-page":"81","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Guo Licheng","year":"2021","unstructured":"Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, 81\u201392."},{"key":"e_1_3_2_15_2","first-page":"1","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Guo Licheng","year":"2022","unstructured":"Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Jie Wang, Yuze Chi, Weikang Qiao, Alireza Kaviani, Zhiru Zhang, and Jason Cong. 2022. RapidStream: Parallel physical implementation of FPGA HLS designs. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. 1\u201312."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304619"},{"key":"e_1_3_2_17_2","first-page":"1087","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Kamalakkannan Kamalavasan","year":"2021","unstructured":"Kamalavasan Kamalakkannan, Gihan R. Mudalige, Istv\u00e1n Z. Reguly, and Suhaib A. Fahmy. 2021. High-level FPGA accelerator design for structured-mesh-based explicit numerical solvers. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1087\u20131096."},{"issue":"1","key":"e_1_3_2_18_2","article-title":"Large-scale cellular automata on FPGAs: A new generic architecture and a framework","volume":"14","author":"Kyparissas Nikolaos","year":"2020","unstructured":"Nikolaos Kyparissas and Apostolos Dollas. 2020. Large-scale cellular automata on FPGAs: A new generic architecture and a framework. ACM Trans. Reconfig. Technol. Syst. 14, 1 (Dec.2020).","journal-title":"ACM Trans. Reconfig. Technol. Syst."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3368826.3377904"},{"key":"e_1_3_2_20_2","first-page":"1","volume-title":"Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design (ICCAD)","author":"Natale Giuseppe","year":"2016","unstructured":"Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proceedings of the IEEE\/ACM International Conference on Computer-Aided Design (ICCAD). 1\u20138."},{"key":"e_1_3_2_21_2","first-page":"1","volume-title":"Proceedings of the ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Nguyen Anthony","year":"2010","unstructured":"Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201313."},{"issue":"3","key":"e_1_3_2_22_2","article-title":"Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components","volume":"14","author":"Reggiani Enrico","year":"2021","unstructured":"Enrico Reggiani, Emanuele Del Sozzo, Davide Conficconi, Giuseppe Natale, Carlo Moroni, and Marco D. Santambrogio. 2021. Enhancing the scalability of multi-FPGA stencil computations via highly optimized HDL components. ACM Trans. Reconfig. Technol. Syst. 14, 3 (Aug.2021).","journal-title":"ACM Trans. Reconfig. Technol. Syst."},{"key":"e_1_3_2_23_2","first-page":"9","volume-title":"Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL)","author":"Singh Gagandeep","year":"2020","unstructured":"Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL). 9\u201317."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2910824"},{"key":"e_1_3_2_25_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Wang Hengjie","year":"2020","unstructured":"Hengjie Wang and Aparna Chandramowlishwaran. 2020. Pencil: A pipelined algorithm for distributed stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201316."},{"key":"e_1_3_2_26_2","first-page":"1","volume-title":"Proceedings of the 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC)","author":"Wang Shuo","year":"2017","unstructured":"Shuo Wang and Yun Liang. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In Proceedings of the 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC). 1\u20136."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01217347"},{"key":"e_1_3_2_28_2","unstructured":"Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/data_sheets\/ds963-u280.pdf."},{"key":"e_1_3_2_29_2","unstructured":"Xilinx. 2020. Vitis Unified Software Platform. Retrieved from https:\/\/www.xilinx.com\/products\/design-tools\/vitis\/vitis-platform.html#development."},{"key":"e_1_3_2_30_2","first-page":"153","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Zohouri Hamid Reza","year":"2018","unstructured":"Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. 153\u2013162."},{"key":"e_1_3_2_31_2","first-page":"123","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","author":"Zohouri Hamid Reza","year":"2018","unstructured":"Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. High-performance high-order stencil computation on FPGAs Using OpenCL. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 123\u2013130."}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572547","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3572547","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:14Z","timestamp":1750182674000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572547"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,17]]},"references-count":30,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3572547"],"URL":"https:\/\/doi.org\/10.1145\/3572547","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,17]]},"assertion":[{"value":"2022-01-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-07","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}