{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:32:52Z","timestamp":1750307572917,"version":"3.41.0"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2010,1,1]],"date-time":"2010-01-01T00:00:00Z","timestamp":1262304000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/C549481\/1EP\/E00024X\/1"],"award-info":[{"award-number":["EP\/C549481\/1EP\/E00024X\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2010,1]]},"abstract":"<jats:p>\n            Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order\n            <jats:italic>n<\/jats:italic>\n            matrix from\n            <jats:italic>\u0398<\/jats:italic>\n            (\n            <jats:italic>n<\/jats:italic>\n            <jats:sup>2<\/jats:sup>\n            ) clock cycles on a microprocessor to\n            <jats:italic>\u0398<\/jats:italic>\n            (\n            <jats:italic>n<\/jats:italic>\n            ) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I\/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.\n          <\/jats:p>","DOI":"10.1145\/1661438.1661439","type":"journal-article","created":{"date-parts":[[2010,1,26]],"date-time":"2010-01-26T14:01:38Z","timestamp":1264514498000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":26,"title":["A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices"],"prefix":"10.1145","volume":"3","author":[{"given":"Antonio","family":"Roldao","sequence":"first","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"George A.","family":"Constantinides","sequence":"additional","affiliation":[{"name":"Imperial College London"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2010,1]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"<scp>Atlas<\/scp>. 2008. Automatically Tuned Linear Algebra Software.  <scp>Atlas<\/scp>. 2008. Automatically Tuned Linear Algebra Software."},{"volume-title":"Proceedings of the International Conference on Field Programmable Technology. 49--56","author":"Bayliss S.","key":"e_1_2_1_2_1","unstructured":"<scp> Bayliss , S. , Bouganis , C. , and Constantinides , G . <\/scp> 2006. An FPGA implementation of the simplex algorithm . In Proceedings of the International Conference on Field Programmable Technology. 49--56 . <scp>Bayliss, S., Bouganis, C., and Constantinides, G.<\/scp> 2006. An FPGA implementation of the simplex algorithm. In Proceedings of the International Conference on Field Programmable Technology. 49--56."},{"volume-title":"PCI-Express - Creating a third generation I\/O interconnect","author":"Bhatt A.","key":"e_1_2_1_3_1","unstructured":"<scp> Bhatt , A. <\/scp> 2007. PCI-Express - Creating a third generation I\/O interconnect . In Intel Developer Network for PCI Express Architecture . 1--11. <scp>Bhatt, A.<\/scp> 2007. PCI-Express - Creating a third generation I\/O interconnect. In Intel Developer Network for PCI Express Architecture. 1--11."},{"key":"e_1_2_1_4_1","volume-title":"<\/scp>","author":"Biglieri E.","year":"2007","unstructured":"<scp> Biglieri , E. , Calderbank , R. , Constantinides , A. , Goldsmith , A. , and Paulraj , A . <\/scp> 2007 . MIMO Wireless Communications. Cambridge University Press , UK. <scp>Biglieri, E., Calderbank, R., Constantinides, A., Goldsmith, A., and Paulraj, A.<\/scp> 2007. MIMO Wireless Communications. Cambridge University Press, UK."},{"volume-title":"Proceedings of the Symposium on Industrial Embedded Systems. 148--155","author":"Bonato V.","key":"e_1_2_1_5_1","unstructured":"<scp> Bonato , V. , Peron , R. , Wolf , D. , Holanda , J. , Marques , E. , and Cardoso , J . <\/scp> 2007. An FPGA implementation for a Kalman filter with application to mobile robotics . In Proceedings of the Symposium on Industrial Embedded Systems. 148--155 . <scp>Bonato, V., Peron, R., Wolf, D., Holanda, J., Marques, E., and Cardoso, J.<\/scp> 2007. An FPGA implementation for a Kalman filter with application to mobile robotics. In Proceedings of the Symposium on Industrial Embedded Systems. 148--155."},{"volume-title":"Proceedings of the Conference on Field Programmable Logic and Applications. 29--35","author":"Callanan O.","key":"e_1_2_1_6_1","unstructured":"<scp> Callanan , O. , Gregg , D. , Nisbet , A. , and Peardon , M . <\/scp> 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD . In Proceedings of the Conference on Field Programmable Logic and Applications. 29--35 . <scp>Callanan, O., Gregg, D., Nisbet, A., and Peardon, M.<\/scp> 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the Conference on Field Programmable Logic and Applications. 29--35."},{"key":"e_1_2_1_7_1","unstructured":"<scp>Clearspeed<\/scp>. 2006. CSX600 Product Brief. http:\/\/support.clearspeed.com\/documentation\/hardware\/csx600\/.  <scp>Clearspeed<\/scp>. 2006. CSX600 Product Brief. http:\/\/support.clearspeed.com\/documentation\/hardware\/csx600\/."},{"key":"e_1_2_1_8_1","unstructured":"<scp>CoreGen<\/scp>. 2006. Core Generator Floating Point v3. http:\/\/www.edaboard.com\/ftopic351915.html.  <scp>CoreGen<\/scp>. 2006. Core Generator Floating Point v3. http:\/\/www.edaboard.com\/ftopic351915.html."},{"key":"e_1_2_1_9_1","unstructured":"<scp>Cray<\/scp>. 2005. XD1 Datasheet. Cray Inc. Seattle WA.  <scp>Cray<\/scp>. 2005. XD1 Datasheet. Cray Inc. Seattle WA."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2008.50"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1046192.1046203"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1046192.1046204"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1058129.1058148"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2008.4536350"},{"volume-title":"Matrix Computations","author":"Golub G.","key":"e_1_2_1_15_1","unstructured":"<scp> Golub , G. and Van-Loan , F. <\/scp> 1996. Matrix Computations . The Johns Hopkins University Press , 53. <scp>Golub, G. and Van-Loan, F.<\/scp> 1996. Matrix Computations. The Johns Hopkins University Press, 53."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"volume-title":"Proceedings of the Scalable High Performance Computing Conference. 76--83","author":"Grote M.","key":"e_1_2_1_17_1","unstructured":"<scp> Grote , M. and Simon , H . <\/scp> 1992. Parallel preconditioning and approximation inverses on the connection machine . In Proceedings of the Scalable High Performance Computing Conference. 76--83 . <scp>Grote, M. and Simon, H.<\/scp> 1992. Parallel preconditioning and approximation inverses on the connection machine. In Proceedings of the Scalable High Performance Computing Conference. 76--83."},{"key":"e_1_2_1_18_1","first-page":"411","article-title":"FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture","volume":"19","author":"Haridas S.","year":"2004","unstructured":"<scp> Haridas , S. and Ziavras , S. <\/scp> 2004 . FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture . J. Parall. Algor. Appl. 19 , 6, 411 -- 226 . <scp>Haridas, S. and Ziavras, S.<\/scp> 2004. FPGA implementation of a Cholesky algorithm for a shared-memory multiprocessor architecture. J. Parall. Algor. Appl. 19, 6, 411--226.","journal-title":"J. Parall. Algor. Appl."},{"volume-title":"Proceedings of the International Conference on Control and Automation. 43--55","author":"He M.","key":"e_1_2_1_19_1","unstructured":"<scp> He , M. and Ling , K . <\/scp> 2005. Model predictive control on a chip . In Proceedings of the International Conference on Control and Automation. 43--55 . <scp>He, M. and Ling, K.<\/scp> 2005. Model predictive control on a chip. In Proceedings of the International Conference on Control and Automation. 43--55."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.6028\/jres.049.044"},{"key":"e_1_2_1_21_1","unstructured":"<scp>IEEE<\/scp>. 1985. 754 standard for binary floating-point arithmetic. http:\/\/grouper.ieee.org\/groups\/754.  <scp>IEEE<\/scp>. 1985. 754 standard for binary floating-point arithmetic. http:\/\/grouper.ieee.org\/groups\/754."},{"key":"e_1_2_1_22_1","volume-title":"<\/scp>","author":"Kelley C.","year":"1999","unstructured":"<scp> Kelley , C. and Sachs , E . <\/scp> 1999 . Truncated newton methods for optimization with inaccurate functions and gradients. SIAM J. Optimiz . 43--55. <scp>Kelley, C. and Sachs, E.<\/scp> 1999. Truncated newton methods for optimization with inaccurate functions and gradients. SIAM J. Optimiz. 43--55."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2007.70813"},{"key":"e_1_2_1_24_1","unstructured":"<scp>Langhammer M.<\/scp> 2004. RSSI - 2008 - Foundation of FPGA acceleration. http:\/\/www.rssi2008.org\/proceedings\/industry\/Altera.pdf.  <scp>Langhammer M.<\/scp> 2004. RSSI - 2008 - Foundation of FPGA acceleration. http:\/\/www.rssi2008.org\/proceedings\/industry\/Altera.pdf."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2008.4629963"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/11752578_63"},{"volume-title":"The Lanczos and Conjugate Gradient Algorithms From Theory to Finite Precision Computation","author":"Meurant G.","key":"e_1_2_1_27_1","unstructured":"<scp> Meurant , G. <\/scp> 2006. The Lanczos and Conjugate Gradient Algorithms From Theory to Finite Precision Computation . SIAM , 323--324. <scp>Meurant, G.<\/scp> 2006. The Lanczos and Conjugate Gradient Algorithms From Theory to Finite Precision Computation. SIAM, 323--324."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPAN.2005.18"},{"key":"e_1_2_1_29_1","unstructured":"<scp>Netlib<\/scp>. 2008. Basic linear algebra subprograms. http:\/\/www.netlib.org\/blas\/.  <scp>Netlib<\/scp>. 2008. Basic linear algebra subprograms. http:\/\/www.netlib.org\/blas\/."},{"volume-title":"Proceedings of the Conference on Field Programmable Logic. 323--328","author":"Pournara I.","key":"e_1_2_1_30_1","unstructured":"<scp> Pournara , I. , Bouganis , C. , and Constantinides , G . <\/scp> 2005. FPGA-Accelerated reconstruction of gene regulatory networks . In Proceedings of the Conference on Field Programmable Logic. 323--328 . <scp>Pournara, I., Bouganis, C., and Constantinides, G.<\/scp> 2005. FPGA-Accelerated reconstruction of gene regulatory networks. In Proceedings of the Conference on Field Programmable Logic. 323--328."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78610-8_10"},{"key":"e_1_2_1_32_1","unstructured":"<scp>Sgi<\/scp>. 2006. RASC RC100 blade. http:\/\/www.sgi.com\/pdfs\/3920.pdf.  <scp>Sgi<\/scp>. 2006. RASC RC100 blade. http:\/\/www.sgi.com\/pdfs\/3920.pdf."},{"key":"e_1_2_1_33_1","unstructured":"<scp>Shewchuk J.<\/scp> 2003. An introduction to the conjugate gradient method without the agonizing pain. http:\/\/www.cs.cmu.edu\/~quake-papers\/painless-conjugate-gradient.pdf.  <scp>Shewchuk J.<\/scp> 2003. An introduction to the conjugate gradient method without the agonizing pain. http:\/\/www.cs.cmu.edu\/~quake-papers\/painless-conjugate-gradient.pdf."},{"key":"e_1_2_1_34_1","unstructured":"<scp>Spec<\/scp>. 2008. Floating point component of standard performance evaluation corporation CPU2000 benchmarks. http:\/\/www.spec.org\/cpu2000\/.  <scp>Spec<\/scp>. 2008. Floating point component of standard performance evaluation corporation CPU2000 benchmarks. http:\/\/www.spec.org\/cpu2000\/."},{"key":"e_1_2_1_35_1","unstructured":"<scp>Tomov S.<\/scp> 2008. GPUs for HPC - NVIDIA\u2019s compute unified device architecture. http:\/\/www.cs.utk.edu\/~dongarra\/WEBPAGES\/SPRING-2008\/Lect09_GPU.pdf.  <scp>Tomov S.<\/scp> 2008. GPUs for HPC - NVIDIA\u2019s compute unified device architecture. http:\/\/www.cs.utk.edu\/~dongarra\/WEBPAGES\/SPRING-2008\/Lect09_GPU.pdf."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/968280.968305"},{"key":"e_1_2_1_37_1","unstructured":"<scp>Virtex5<\/scp>. 2007. DS100 (v3.0) Virtex5 family overview - LX LXT and SXT platforms. http:\/\/www.silica.com\/fileadmin\/02_Products\/05_Product-News\/09_PLD\/XLX-XCSVSXT\/DS_XLX_XC5VSXT_rev3-0_Feb07.pdf.  <scp>Virtex5<\/scp>. 2007. DS100 (v3.0) Virtex5 family overview - LX LXT and SXT platforms. http:\/\/www.silica.com\/fileadmin\/02_Products\/05_Product-News\/09_PLD\/XLX-XCSVSXT\/DS_XLX_XC5VSXT_rev3-0_Feb07.pdf."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1128022.1128027"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/3037621.3037632"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00940784"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2005.31"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1661438.1661439","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1661438.1661439","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T12:41:03Z","timestamp":1750250463000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1661438.1661439"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,1]]},"references-count":41,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2010,1]]}},"alternative-id":["10.1145\/1661438.1661439"],"URL":"https:\/\/doi.org\/10.1145\/1661438.1661439","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2010,1]]},"assertion":[{"value":"2008-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-01-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}