{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:12:16Z","timestamp":1767985936892,"version":"3.49.0"},"reference-count":10,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2017,1,11]],"date-time":"2017-01-11T00:00:00Z","timestamp":1484092800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGARCH Comput. Archit. News"],"published-print":{"date-parts":[[2017,1,11]]},"abstract":"<jats:p>In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a blockoriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using highlevel synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices.<\/jats:p>","DOI":"10.1145\/3039902.3039916","type":"journal-article","created":{"date-parts":[[2017,1,17]],"date-time":"2017-01-17T13:42:08Z","timestamp":1484660528000},"page":"74-79","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication"],"prefix":"10.1145","volume":"44","author":[{"given":"Erik H.","family":"D'Hollander","sequence":"first","affiliation":[{"name":"Ghent University, Ghent, Belgium"}]}],"member":"320","published-online":{"date-parts":[[2017,1,11]]},"reference":[{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/MAHC.2002.1114865"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1049\/iet-cdt.2011.0132"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0131-8"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPACS.2004.1439140"},{"key":"e_1_2_1_6_1","unstructured":"Xilinx 2014. LogiCORE IP Floating-Point Operator v7.0 Product Guide PG060.  Xilinx 2014. LogiCORE IP Floating-Point Operator v7.0 Product Guide PG060."},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Xilinx 2015. Vivado Design Suite User Guide: High-Level Synthesis (UG902 2015.1).  Xilinx 2015. Vivado Design Suite User Guide: High-Level Synthesis (UG902 2015.1).","DOI":"10.1155\/2015\/581961"},{"key":"e_1_2_1_8_1","unstructured":"Xilinx 2015. Vivado Design Suite User Guide: Synthesis (UG901).  Xilinx 2015. Vivado Design Suite User Guide: Synthesis (UG901)."},{"key":"e_1_2_1_9_1","unstructured":"Xilinx 2014. Zynq-7000 Technical Reference Manual UG585 (v1.7).  Xilinx 2014. Zynq-7000 Technical Reference Manual UG585 (v1.7)."},{"key":"e_1_2_1_10_1","unstructured":"ZedBoard 2013. ZedBoard Hardware User's Guide.  ZedBoard 2013. ZedBoard Hardware User's Guide."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2007.1001"}],"container-title":["ACM SIGARCH Computer Architecture News"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3039902.3039916","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3039902.3039916","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:36:31Z","timestamp":1750217791000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3039902.3039916"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,1,11]]},"references-count":10,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,1,11]]}},"alternative-id":["10.1145\/3039902.3039916"],"URL":"https:\/\/doi.org\/10.1145\/3039902.3039916","relation":{},"ISSN":["0163-5964"],"issn-type":[{"value":"0163-5964","type":"print"}],"subject":[],"published":{"date-parts":[[2017,1,11]]},"assertion":[{"value":"2017-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}