{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T16:35:18Z","timestamp":1773246918884,"version":"3.50.1"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T00:00:00Z","timestamp":1748649600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Science Foundation of China","doi-asserted-by":"crossref","award":["T2325001"],"award-info":[{"award-number":["T2325001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>Linear algebra computations can be greatly accelerated using spatial accelerators on FPGAs. As a standard building block of linear algebra applications, BLAS covers a wide range of compute patterns that vary vastly in data reuse, bottleneck resources, matrix storage layouts, and data types. However, existing implementations of BLAS routines on FPGAs are stuck in the dilemma of productivity and performance. They either require extensive human effort or fail to leverage the properties of routines for acceleration.<\/jats:p>\n          <jats:p>We introduce Lasa, a framework composed of a programming model and a compiler, designed to address the dilemma by abstracting (for productivity) and specializing (for performance) the architecture of a spatial accelerator. The programming model realizes systolic arrays using uniform recurrence equations and space-time transforms. Streaming tensors, an intuitive dataflow-style abstraction, is proposed to uniformly describe the movement, storage, and transpose of input and output data across the spatial components. According to streaming tensors, a customized memory hierarchy is automatically built on an FPGA by our compiler. The compiler further specializes the architecture with transparent optimizations on FPGAs. Using this framework, we develop a complete BLAS library, demonstrating performance in parity with expert-written HLS code for BLAS level 3 routines, 76%\u201394% machine peak for level 1 and 2 routines, and 1.6X\u201313X speedup by leveraging the matrix properties such as symmetry, triangularity, and bandness.<\/jats:p>","DOI":"10.1145\/3723046","type":"journal-article","created":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T16:50:21Z","timestamp":1741711821000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Productively Generating a High-Performance Linear Algebra Library on FPGAs"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-2127-6011","authenticated-orcid":false,"given":"Xiaochen","family":"Hao","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1464-5271","authenticated-orcid":false,"given":"Mingzhe","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6915-1489","authenticated-orcid":false,"given":"Ce","family":"Sun","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0951-1811","authenticated-orcid":false,"given":"Zhuofu","family":"Tao","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, Los Angeles, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3275-7791","authenticated-orcid":false,"given":"Hongbo","family":"Rong","sequence":"additional","affiliation":[{"name":"Parallel Computing Lab, Intel Corporation, Santa Clara, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6638-6442","authenticated-orcid":false,"given":"Yu","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5266-3805","authenticated-orcid":false,"given":"Lei","family":"He","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, Los Angeles, California, USA and Eastern Institute of Technology, Ningbo, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5047-1407","authenticated-orcid":false,"given":"Eric","family":"Petit","sequence":"additional","affiliation":[{"name":"Intel Corporation, Santa Clara, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4281-1018","authenticated-orcid":false,"given":"Wenguang","family":"Chen","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9076-7998","authenticated-orcid":false,"given":"Yun","family":"Liang","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,5,31]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"012037","volume-title":"Journal of Physics: Conference Series","volume":"180","author":"Agullo Emmanuel","year":"2009","unstructured":"Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180 (2009), 012037."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3656401"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1008012332212"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3530775"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240838"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/42288.42291"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18074.2021.9586216"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2001.924973"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637566"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM57271.2023.00013"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2021.3123465"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3519939.3523446"},{"key":"e_1_3_1_14_2","unstructured":"Intel. 2024. DevCloud. Retrieved from https:\/\/software.intel.com\/devcloud"},{"key":"e_1_3_1_15_2","unstructured":"Intel. 2024. FPGA SDK for OpenCL Pro Edition: Programming Guide. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683846\/19-4\/introduction.html"},{"key":"e_1_3_1_16_2","unstructured":"Intel. 2024. Intra-Kernel Registered Assignment Built-In Function. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683846\/21-4\/intra-kernel-registered-assignment-built.html"},{"key":"e_1_3_1_17_2","unstructured":"Intel. 2024. Matrix Multiplication Design Example. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/support\/programmable\/support-resources\/design-examples\/horizontal\/matrix-multiplication.html"},{"key":"e_1_3_1_18_2","unstructured":"Intel. 2024. oneAPI Math Kernel Library. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/onemkl.html"},{"key":"e_1_3_1_19_2","unstructured":"Intel. 2024. Using a Single Kernel to Describe Systolic Arrays. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683521\/21-4\/using-a-single-kernel-to-describe-systolic.html"},{"key":"e_1_3_1_20_2","unstructured":"Intel. 2024. XML Elements Attributes and Parameters in the Board_spec.xml File\u2014global_mem. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683085\/20-3\/global-mem.html"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18074.2021.9586329"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530411"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/321406.321418"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.1982.1653825"},{"key":"e_1_3_1_25_2","first-page":"256","volume-title":"Sparse Matrix Proceedings 1978","volume":"1","author":"Kung Hsiang Tsung","year":"1979","unstructured":"Hsiang Tsung Kung and Charles E. Leiserson. 1979. Systolic arrays (for VLSI). In Sparse Matrix Proceedings 1978, Vol. 1. Society for Industrial and Applied Mathematics, Philadelphia, PA, 256\u2013282"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293910"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415644"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3469660"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00062"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00063"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174258"},{"key":"e_1_3_1_32_2","unstructured":"Netlib. 2024. BLAS. Retrieved from https:\/\/netlib.org\/blas\/"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3385412.3385974"},{"key":"e_1_3_1_34_2","unstructured":"NVIDIA. 2024. Basic Linear Algebra on NVIDIA GPUs. Retrieved from https:\/\/developer.nvidia.com\/cublas"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/800015.808184"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_1_37_2","unstructured":"Hongbo Rong. 2017. Programmatic control of a compiler for generating high-performance spatial hardware. arXiv:1711.07606. Retrieved from http:\/\/arxiv.org\/abs\/1711.07606"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3494534"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2019.00033"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/1278349.1278353"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502369"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502351"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18072.2020.9218748"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00086"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637561"},{"key":"e_1_3_1_48_2","unstructured":"Xilinx. 2024. Vitis BLAS Library. Retrieved from https:\/\/github.com\/Xilinx\/Vitis_Libraries\/tree\/master\/blas"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO61859.2024.00062"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3508352.3549370"},{"key":"e_1_3_1_51_2","unstructured":"Jingling Xue. 1992. Formal Synthesis of Control Signals for Systolic Arrays. Ph.D. Dissertation. University of Edinburgh Edinburgh UK. Retrieved from https:\/\/hdl.handle.net\/1842\/11628"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00060"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527440"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10247981"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2008.55"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3723046","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3723046","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:56:42Z","timestamp":1750298202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3723046"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,31]]},"references-count":54,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3723046"],"URL":"https:\/\/doi.org\/10.1145\/3723046","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,31]]},"assertion":[{"value":"2024-06-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-31","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}