{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,22]],"date-time":"2026-01-22T06:49:25Z","timestamp":1769064565822,"version":"3.49.0"},"reference-count":31,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,3,21]],"date-time":"2025-03-21T00:00:00Z","timestamp":1742515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Key Digital Technologies Joint Undertaking under the REBECCA Project","award":["JPMJCR21D2"],"award-info":[{"award-number":["JPMJCR21D2"]}]},{"name":"Spoke 1 on Future HPC of the Italian Research Center on High-Performance Computing, Big Data and Quantum Computing"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>High-level synthesis (HLS) aims at democratizing custom hardware acceleration with highly abstracted software-like descriptions. However, efficient accelerators still require substantial low-level hardware optimizations, defeating the HLS intent. In the context of field-programmable gate arrays, digital signal processors (DSPs) are a crucial resource that typically requires a significant optimization effort for its efficient utilization, especially when used for sub-word vectorization. This work proposes SILVIA, an open-source LLVM transformation pass that automatically identifies superword-level parallelism within an HLS design and exploits it by packing multiple operations, such as additions, multiplications, and multiply-and-adds, into a single DSP. SILVIA is integrated in the flow of the commercial AMD Vitis HLS tool and proves its effectiveness by packing multiple operations on the DSPs without any manual source-code modifications on several diverse state-of-the-art HLS designs such as convolutional neural networks and basic linear algebra subprograms accelerators, reducing the DSP utilization for additions by 70% and for multiplications and multiply-and-adds by 50% on average.<\/jats:p>","DOI":"10.1145\/3705324","type":"journal-article","created":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T13:59:16Z","timestamp":1732197556000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["SILVIA: Automated Superword-Level Parallelism Exploitation via HLS-specific LLVM Passes for Compute-Intensive FPGA Accelerators"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1656-8376","authenticated-orcid":false,"given":"Giovanni","family":"Brignone","sequence":"first","affiliation":[{"name":"Politecnico di Torino, Turin, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-3431-9618","authenticated-orcid":false,"given":"Roberto","family":"Bosio","sequence":"additional","affiliation":[{"name":"Politecnico di Torino, Turin, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2989-3634","authenticated-orcid":false,"given":"Fabrizio","family":"Ottati","sequence":"additional","affiliation":[{"name":"Politecnico di Torino, Turin, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2565-9077","authenticated-orcid":false,"given":"Claudio","family":"Sanso\u00e8","sequence":"additional","affiliation":[{"name":"Politecnico di Torino, Turin, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9762-6522","authenticated-orcid":false,"given":"Luciano","family":"Lavagno","sequence":"additional","affiliation":[{"name":"Politecnico di Torino, Turin, Italy"}]}],"member":"320","published-online":{"date-parts":[[2025,3,21]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3508352.3549424"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/212094.212131"},{"key":"e_1_3_2_4_2","unstructured":"AMD. 2022. Versal ACAP DSP Engine Architecture Manual (AM004). AMD. Retrieved from https:\/\/docs.amd.com\/r\/en-US\/am004-versal-dsp-engine"},{"key":"e_1_3_2_5_2","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1145\/1950413.1950423","volume-title":"Proceedings of the 19th ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Canis Andrew","year":"2011","unstructured":"Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor\/accelerator systems. In Proceedings of the 19th ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 33\u201336."},{"key":"e_1_3_2_6_2","unstructured":"Yao Fu Ephrem Wu Ashish Sirasao Sedny Attia Kamran Khan and Ralph Wittig. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Xilinx. Retrieved from https:\/\/japan.origin.xilinx.com\/content\/dam\/xilinx\/support\/documents\/white_papers\/wp486-deep-learning-int8.pdf"},{"key":"e_1_3_2_7_2","first-page":"1192","volume-title":"Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS)","author":"Hara Yuko","year":"2008","unstructured":"Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii. 2008. Chstone: A benchmark program suite for practical c-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1192\u20131195."},{"key":"e_1_3_2_8_2","unstructured":"Free Software Foundation Inc. 2023. Auto-Vectorization in GCC. Free Software Foundation Inc. Retrieved June 6 2024 from Retrieved from https:\/\/gcc.gnu.org\/projects\/tree-ssa\/vectorization.html"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174264"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2023.3279349"},{"key":"e_1_3_2_11_2","first-page":"161","volume-title":"Proceedings of the International Conference on Field Programmable Technology (ICFPT)","author":"Liu Qi","year":"2023","unstructured":"Qi Liu, Mo Sun, Jie Sun, Liqiang Lu, Jieru Zhao, and Zeke Wang. 2023. SSiMD: Supporting six signed multiplications in a DSP block for low-precision CNN on FPGAs. In Proceedings of the International Conference on Field Programmable Technology (ICFPT). IEEE, 161\u2013169."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD57390.2023.10323831"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/tcad.2024.3507570"},{"key":"e_1_3_2_14_2","volume-title":"Efficient Deep Learning Inference: A Digital Hardware Perspective-Evaluating and Improving Performance and Efficiency of Artificial and Spiking Neural Networks Hardware Accelerators","author":"Ottati Fabrizio","year":"2024","unstructured":"Fabrizio Ottati. 2024. Efficient Deep Learning Inference: A Digital Hardware Perspective-Evaluating and Improving Performance and Efficiency of Artificial and Spiking Neural Networks Hardware Accelerators. Ph.D. Dissertation. Politecnico di Torino."},{"key":"e_1_3_2_15_2","unstructured":"Thomas B. Preusser and Thomas A. Branca. 2020. Vectorization of wide integer data paths for parallel operations with side-band logic monitoring the numeric overflow between vector lanes. US Patent 10 671 388."},{"key":"e_1_3_2_16_2","first-page":"1099","volume-title":"Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"Sarkar Rishov","year":"2023","unstructured":"Rishov Sarkar, Stefan Abi-Karam, Yuqi He, Lakshmi Sathidevi, and Cong Hao. 2023. FlowGNN: A dataflow architecture for real-time workload-agnostic graph neural network inference. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1099\u20131112."},{"key":"e_1_3_2_17_2","first-page":"160","volume-title":"Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications (FPL)","author":"Sommer Jan","year":"2022","unstructured":"Jan Sommer, M. Akif \u00d6zkan, Oliver Keszocze, and J\u00fcrgen Teich. 2022. DSP-packing: Squeezing low-precision arithmetic into FPGA DSP blocks. In Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 160\u2013166."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2761740"},{"key":"e_1_3_2_19_2","unstructured":"LLVM team. 2012. LLVM 3.1 Release Notes. LLVM team. Retrieved May 5 2024 from Retrieved from https:\/\/releases.llvm.org\/3.1\/docs\/ReleaseNotes.html"},{"key":"e_1_3_2_20_2","unstructured":"LLVM team. 2024. Auto-Vectorization in LLVM. LLVM team. Retrieved June 6 2024 from Retrieved from https:\/\/llvm.org\/docs\/Vectorizers.html"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1145\/3020078.3021744","volume-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA \u201917)","author":"Umuroglu Yaman","year":"2017","unstructured":"Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA \u201917). ACM, New York, NY, 65\u201374."},{"key":"e_1_3_2_22_2","unstructured":"Xilinx 2020. Convolutional Neural Network with INT4 Optimization on Xilinx Devices. Xilinx. Retrieved from https:\/\/docs.amd.com\/v\/u\/en-US\/wp521-4bit-optimization"},{"key":"e_1_3_2_23_2","unstructured":"Xilinx 2021. UltraScale Architecture DSP Slice. Xilinx. Retrieved from https:\/\/docs.amd.com\/v\/u\/en-US\/ug579-ultrascale-dsp"},{"key":"e_1_3_2_24_2","unstructured":"Xilinx. 2024. HLS. Retrieved May 4 2024 from https:\/\/github.com\/Xilinx\/HLS"},{"key":"e_1_3_2_25_2","unstructured":"Xilinx. 2024. Vitis HLS Introductory Examples. Retrieved May 4 2024 from https:\/\/github.com\/Xilinx\/Vitis-HLS-Introductory-Examples"},{"key":"e_1_3_2_26_2","unstructured":"Xilinx. 2024. Vitis Libraries. Retrieved May 4 2024 from https:\/\/github.com\/Xilinx\/Vitis_Libraries"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASP-DAC58780.2024.10473872"},{"key":"e_1_3_2_28_2","first-page":"1355","volume-title":"Proceedings of the 59th ACM\/IEEE Design Automation Conference","author":"Ye Hanchen","year":"2022","unstructured":"Hanchen Ye, HyeGang Jun, Hyunmin Jeong, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A scalable high-level synthesis framework with multi-level transformations and optimizations: Invited. In Proceedings of the 59th ACM\/IEEE Design Automation Conference. ACM, New York, NY, 1355\u20131358."},{"key":"e_1_3_2_29_2","first-page":"1","volume-title":"Proceedings of the 60th ACM\/IEEE Design Automation Conference (DAC)","author":"Zhang Jingwei","year":"2023","unstructured":"Jingwei Zhang, Meng Zhang, Xinye Cao, and Guoqing Li. 2023. Uint-Packing: Multiply your DNN accelerator performance via unsigned integer DSP packing. In Proceedings of the 60th ACM\/IEEE Design Automation Conference (DAC). ACM, New York, NY, 1\u20136."},{"key":"e_1_3_2_30_2","first-page":"1","volume-title":"Proceedings of the 41st IEEE\/ACM International Conference on Computer-Aided Design","author":"Zhang Yunxiang","year":"2022","unstructured":"Yunxiang Zhang, Biao Sun, Weixiong Jiang, Yajun Ha, Miao Hu, and Wenfeng Zhao. 2022. Wsq-addernet: Efficient weight standardization based quantized addernet fpga accelerator design with high-density int8 dsp-lut co-packing optimization. In Proceedings of the 41st IEEE\/ACM International Conference on Computer-Aided Design. ACM, New York, NY, 1\u20139."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/2435264.2435271"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","unstructured":"Giovanni Brignone and Fabrizio Ottati. (2024). brigio345\/SILVIA: FPT \u201924 (FPT \u201924). Zenodo. DOI: 10.5281\/zenodo.14198854","DOI":"10.5281\/zenodo.14198854"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3705324","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3705324","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T12:52:27Z","timestamp":1751460747000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3705324"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,21]]},"references-count":31,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3705324"],"URL":"https:\/\/doi.org\/10.1145\/3705324","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,21]]},"assertion":[{"value":"2024-06-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}