{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:26:07Z","timestamp":1750220767106,"version":"3.41.0"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T00:00:00Z","timestamp":1620604800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000006","name":"Office of Naval Research","doi-asserted-by":"publisher","award":["N00014-18-1-2740"],"award-info":[{"award-number":["N00014-18-1-2740"]}],"id":[{"id":"10.13039\/100000006","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>QR decomposition is one of the most useful factorization kernels in modern numerical linear algebra algorithms. In particular, the decomposition of tall-and-skinny matrices (TSMs) has major applications in areas including scientific computing, machine learning, image processing, wireless networks, and numerical methods. Traditionally, CPUs and GPUs have achieved better throughput on these applications by using large cache hierarchies and compute cores running at a high frequency, leading to high power consumption. With the advent of heterogeneous platforms, however, FPGAs are emerging as a promising viable alternative. In this work, we propose a high-throughput FPGA-based engine that has a very high computational efficiency (ratio of achieved to peak throughput) compared to similar QR solvers running on FPGAs. Although comparable QR solvers achieve an efficiency of 36%, our design exhibits an efficiency of 54%. For TSMs, our experimental results show that our design can outperform highly optimized QR solvers running on CPUs and GPUs. For TSMs with more than 50K rows, our design outperforms the Intel MKL solver running on an Intel quad-core processor by a factor of 1.5\u00d7. For TSMs containing 256 columns or less, our design outperforms the NVIDIA CUBLAS solver running on a K40 GPU by a factor of 3.0\u00d7. In addition to being fast, our design is energy efficient\u2014competing platforms execute up to 0.6 GFLOPS\/Joule, whereas our design executes more than 1.0 GFLOPS\/Joule.<\/jats:p>","DOI":"10.1145\/3447775","type":"journal-article","created":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T22:15:17Z","timestamp":1620684917000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1041-0017","authenticated-orcid":false,"given":"Jose M. Rodriguez","family":"Borbon","sequence":"first","affiliation":[{"name":"University of California, Riverside"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6761-7967","authenticated-orcid":false,"given":"Junjie","family":"Huang","sequence":"additional","affiliation":[{"name":"University of California, Riverside"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3477-8043","authenticated-orcid":false,"given":"Bryan M.","family":"Wong","sequence":"additional","affiliation":[{"name":"University of California, Riverside"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Walid","family":"Najjar","sequence":"additional","affiliation":[{"name":"University of California, Riverside"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,5,10]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.90"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.15"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MWSCAS.2012.6292059"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2013.6691583"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1137\/0908009"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-12133-3_39"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2576779.2576783"},{"key":"e_1_2_1_8_1","volume-title":"Robust principal component analysis?Journal of the ACM 58, 3","author":"Cand\u00e8s Emmanuel J","year":"2011","unstructured":"Emmanuel J Cand\u00e8s , Xiaodong Li , Yi Ma , and John Wright . 2011. Robust principal component analysis?Journal of the ACM 58, 3 ( 2011 ), 11. Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. 2011. Robust principal component analysis?Journal of the ACM 58, 3 (2011), 11."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICESC.2014.20"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SASP.2008.4570793"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 26th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM \u201918)","author":"Cong Jason","year":"2018","unstructured":"Jason Cong , Zhenman Fang , Michael Lo , Hanrui Wang , Jingxian Xu , and Shaochong Zhang . 2018 . Understanding performance differences of FPGAs and GPUs . In Proceedings of the 26th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM \u201918) . IEEE, Los Alamitos, CA, 93--96. Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the 26th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM \u201918). IEEE, Los Alamitos, CA, 93--96."},{"volume-title":"Proceedings of the 2nd ACM International Workshop on MapReduce and Its Applications. ACM","author":"Paul","key":"e_1_2_1_12_1","unstructured":"Paul G. Constantine and David F. Gleich. 2011. Tall and skinny QR factorizations in MapReduce architectures . In Proceedings of the 2nd ACM International Workshop on MapReduce and Its Applications. ACM , New York, NY, 43--50. Paul G. Constantine and David F. Gleich. 2011. Tall and skinny QR factorizations in MapReduce architectures. In Proceedings of the 2nd ACM International Workshop on MapReduce and Its Applications. ACM, New York, NY, 43--50."},{"volume-title":"Proceedings of the 3rd ACM Workshop on General-Purpose Computation on Graphics Processing Units. ACM","author":"Danalis Anthony","key":"e_1_2_1_13_1","unstructured":"Anthony Danalis , Gabriel Marin , Collin McCurdy , Jeremy S. Meredith , Philip C. Roth , Kyle Spafford , Vinod Tipparaju , and Jeffrey S. Vetter . 2010. The Scalable Heterogeneous Computing (SHOC) benchmark suite . In Proceedings of the 3rd ACM Workshop on General-Purpose Computation on Graphics Processing Units. ACM , New York, NY, 63--74. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd ACM Workshop on General-Purpose Computation on Graphics Processing Units. ACM, New York, NY, 63--74."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1137\/080731992"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCISE.2000.814652"},{"key":"e_1_2_1_16_1","volume-title":"Retrieved","author":"Dua Dheeru","year":"2017","unstructured":"Dheeru Dua and Casey Graff . 2017 . UCI Machine Learning Repository . Retrieved February 23, 2021 from http:\/\/archive.ics.uci.edu\/ml Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved February 23, 2021 from http:\/\/archive.ics.uci.edu\/ml"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSD.2009.141"},{"key":"e_1_2_1_19_1","first-page":"471","article-title":"The semantics of a simple language for parallel programming","volume":"74","author":"Gilles Kahn","year":"1974","unstructured":"Kahn Gilles . 1974 . The semantics of a simple language for parallel programming . Information Processing 74 (1974), 471 -- 475 . Kahn Gilles. 1974. The semantics of a simple language for parallel programming. Information Processing 74 (1974), 471--475.","journal-title":"Information Processing"},{"volume-title":"Matrix Computations","author":"Golub Gene H.","key":"e_1_2_1_20_1","unstructured":"Gene H. Golub and C. F. Van Loan . 2013. Matrix Computations . Johns Hopkins University Press , Baltimore, MD . Gene H. Golub and C. F. Van Loan. 2013. Matrix Computations. Johns Hopkins University Press, Baltimore, MD."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/968280.968304"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2010.5470443"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2014.6757323"},{"key":"e_1_2_1_25_1","first-page":"339","article-title":"Unitary triangularization of a nonsymmetric matrix","volume":"5","author":"Householder Alston S.","year":"1958","unstructured":"Alston S. Householder . 1958 . Unitary triangularization of a nonsymmetric matrix . Journal of the 5 , 4 (1958), 339 -- 342 . Alston S. Householder. 1958. Unitary triangularization of a nonsymmetric matrix. Journal of the 5, 4 (1958), 339--342.","journal-title":"Journal of the"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1513895.1513904"},{"volume-title":"Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann","author":"Kirk David B.","key":"e_1_2_1_27_1","unstructured":"David B. Kirk and W. Hwu Wen-Mei . 2016. Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann , Waltham, MA . David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann, Waltham, MA."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174273"},{"volume-title":"Convey PDK Reference Manual. Retrieved","year":"2018","key":"e_1_2_1_29_1","unstructured":"Micron. 2018. Convey PDK Reference Manual. Retrieved September 1, 2018 from http:\/\/www.conveysupport.com\/ Micron. 2018. Convey PDK Reference Manual. Retrieved September 1, 2018 from http:\/\/www.conveysupport.com\/"},{"volume-title":"Retrieved","year":"2020","key":"e_1_2_1_30_1","unstructured":"Micron. 2020 . Micron SB-852 Wolverine II . Retrieved January 15, 2020 from https:\/\/www.micron.com\/products\/advanced-solutions\/advanced-computing-solutions\/hpc-single-board-accelerators\/sb-852 Micron. 2020. Micron SB-852 Wolverine II. Retrieved January 15, 2020 from https:\/\/www.micron.com\/products\/advanced-solutions\/advanced-computing-solutions\/hpc-single-board-accelerators\/sb-852"},{"key":"e_1_2_1_31_1","article-title":"High-throughput FPGA implementation of QR decomposition","author":"Mu\u00f1oz Sergio D.","year":"2015","unstructured":"Sergio D. Mu\u00f1oz and Javier Hormigo . 2015 . High-throughput FPGA implementation of QR decomposition . IEEE Transactions on Circuits and Systems II 62, 9 (2015), 861--865. Sergio D. Mu\u00f1oz and Javier Hormigo. 2015. High-throughput FPGA implementation of QR decomposition. IEEE Transactions on Circuits and Systems II 62, 9 (2015), 861--865.","journal-title":"IEEE Transactions on Circuits and Systems"},{"key":"e_1_2_1_32_1","volume-title":"CUBLAS NVIDIA\u2019s Dense Linear Algebra on GPUs. Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018. CUBLAS NVIDIA\u2019s Dense Linear Algebra on GPUs. Retrieved June 1, 2018 from https:\/\/developer.nvidia.com\/cublas NVIDIA. 2018. CUBLAS NVIDIA\u2019s Dense Linear Algebra on GPUs. Retrieved June 1, 2018 from https:\/\/developer.nvidia.com\/cublas"},{"key":"e_1_2_1_33_1","volume-title":"CUDA Toolkit Documentation. Retrieved","author":"NVIDIA.","year":"2018","unstructured":"NVIDIA. 2018. CUDA Toolkit Documentation. Retrieved September 4, 2018 from https:\/\/docs.nvidia.com\/cuda\/ NVIDIA. 2018. CUDA Toolkit Documentation. Retrieved September 4, 2018 from https:\/\/docs.nvidia.com\/cuda\/"},{"key":"e_1_2_1_34_1","volume-title":"NVML API Reference. Retrieved","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2019. NVML API Reference. Retrieved September 4, 2019 from https:\/\/docs.nvidia.com\/deploy\/nvml-api NVIDIA. 2019. NVML API Reference. Retrieved September 4, 2019 from https:\/\/docs.nvidia.com\/deploy\/nvml-api"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2014.6844463"},{"volume-title":"Proceedings of the IEEE International Conference on Advanced Science and Engineering (ICOASE \u201918)","author":"Safaa","key":"e_1_2_1_36_1","unstructured":"Safaa S. Omran and Ahmed K. Abdul-Abbas. 2018. Fast QR decomposition based on FPGA . In Proceedings of the IEEE International Conference on Advanced Science and Engineering (ICOASE \u201918) . IEEE, Los Alamitos, CA, 189--193. Safaa S. Omran and Ahmed K. Abdul-Abbas. 2018. Fast QR decomposition based on FPGA. In Proceedings of the IEEE International Conference on Advanced Science and Engineering (ICOASE \u201918). IEEE, Los Alamitos, CA, 189--193."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/NAECON.2016.7856841"},{"volume-title":"Proceedings of the 22nd IEEE International Conference on Field Programmable Logic and Applications (FPL \u201912)","author":"Rafique Abid","key":"e_1_2_1_38_1","unstructured":"Abid Rafique , Nachiket Kapre , and George A. Constantinides . 2012. Enhancing performance of Tall-Skinny QR factorization using FPGAs . In Proceedings of the 22nd IEEE International Conference on Field Programmable Logic and Applications (FPL \u201912) . IEEE, Los Alamitos, CA, 443--450. Abid Rafique, Nachiket Kapre, and George A. Constantinides. 2012. Enhancing performance of Tall-Skinny QR factorization using FPGAs. In Proceedings of the 22nd IEEE International Conference on Field Programmable Logic and Applications (FPL \u201912). IEEE, Los Alamitos, CA, 443--450."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jctc.9b01284"},{"key":"e_1_2_1_40_1","series-title":"Lecture Notes in Computer Science","volume-title":"Parallel Processing and Applied Mathematics","author":"Sergyienko Anatoli","unstructured":"Anatoli Sergyienko and Oleg Maslennikov . 2001. Implementation of Givens QR-decomposition in FPGA . In Parallel Processing and Applied Mathematics . Lecture Notes in Computer Science , Vol. 2328 . Springer , 458--465. Anatoli Sergyienko and Oleg Maslennikov. 2001. Implementation of Givens QR-decomposition in FPGA. In Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, Vol. 2328. Springer, 458--465."},{"volume-title":"Proceedings of the 23rd Annual IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE","author":"Shannon Lesley","key":"e_1_2_1_41_1","unstructured":"Lesley Shannon , Veronica Cojocaru , Cong Nguyen Dao , and Philip H. W. Leong . 2015. Technology scaling in FPGAs: Trends in applications and architectures . In Proceedings of the 23rd Annual IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE , Los Alamitos, CA, 1--8. Lesley Shannon, Veronica Cojocaru, Cong Nguyen Dao, and Philip H. W. Leong. 2015. Technology scaling in FPGAs: Trends in applications and architectures. In Proceedings of the 23rd Annual IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 1--8."},{"key":"e_1_2_1_42_1","first-page":"2008","article-title":"Where\u2019s the Beef? Why FPGAs Are so Fast","author":"Sirowy Scott","year":"2008","unstructured":"Scott Sirowy and Alessandro Forin . 2008 . Where\u2019s the Beef? Why FPGAs Are so Fast . MSR-TR 2008 - 2130 . Microsoft Research, Redmond, WA. Scott Sirowy and Alessandro Forin. 2008. Where\u2019s the Beef? Why FPGAs Are so Fast. MSR-TR 2008-130. Microsoft Research, Redmond, WA.","journal-title":"MSR-TR"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/5992.814658"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2011.91"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2009.12.005"},{"key":"e_1_2_1_46_1","volume-title":"Trefethen and David Bau III","author":"Lloyd","year":"1997","unstructured":"Lloyd N. Trefethen and David Bau III . 1997 . Numerical Linear Algebra. SIAM , Philadelphia, PA. Lloyd N. Trefethen and David Bau III. 1997. Numerical Linear Algebra. SIAM, Philadelphia, PA."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2010.38"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-06486-4_7"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503219"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI.2014.92"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1596532.1596535"},{"volume-title":"Fundamentals of Matrix Computations","author":"Watkins David S.","key":"e_1_2_1_52_1","unstructured":"David S. Watkins . 2004. Fundamentals of Matrix Computations . John Wiley & Sons , Danvers, MA . David S. Watkins. 2004. Fundamentals of Matrix Computations. John Wiley & Sons, Danvers, MA."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2007.1068"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447775","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447775","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447775","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:10Z","timestamp":1750200070000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447775"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,10]]},"references-count":51,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3447775"],"URL":"https:\/\/doi.org\/10.1145\/3447775","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2021,5,10]]},"assertion":[{"value":"2020-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}