{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T11:57:50Z","timestamp":1759147070450,"version":"3.41.0"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2013,9,1]],"date-time":"2013-09-01T00:00:00Z","timestamp":1377993600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2013,9]]},"abstract":"<jats:p>The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.<\/jats:p>","DOI":"10.1145\/2514641.2514652","type":"journal-article","created":{"date-parts":[[2013,10,1]],"date-time":"2013-10-01T18:14:28Z","timestamp":1380651268000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Efficient compilation of CUDA kernels for high-performance computing on FPGAs"],"prefix":"10.1145","volume":"13","author":[{"given":"Alexandros","family":"Papakonstantinou","sequence":"first","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL"}]},{"given":"Karthik","family":"Gururaj","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, CA"}]},{"given":"John A.","family":"Stratton","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL"}]},{"given":"Deming","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, IL"}]},{"given":"Jason","family":"Cong","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, CA"}]},{"given":"Wen-Mei W.","family":"Hwu","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-champaign, IL"}]}],"member":"320","published-online":{"date-parts":[[2013,9,30]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Aho A. V. Lam M. S. Sethi R. and Ullman J. D. 2006. Compilers Principles Techniques and Tools 2nd ed. Addison-Wesley.   Aho A. V. Lam M. S. Sethi R. and Ullman J. D. 2006. Compilers Principles Techniques and Tools 2 nd ed. Addison-Wesley."},{"key":"e_1_2_1_2_1","unstructured":"Allen R. and Kennedy K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Academic Press.   Allen R. and Kennedy K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Academic Press."},{"key":"e_1_2_1_3_1","unstructured":"AMD. 2012. Accelerated processing units. http:\/\/www.amd.com\/us\/products\/technologies\/fusion\/Pages\/fusion.aspx.  AMD. 2012. Accelerated processing units. http:\/\/www.amd.com\/us\/products\/technologies\/fusion\/Pages\/fusion.aspx."},{"key":"e_1_2_1_4_1","unstructured":"BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http:\/\/www.bdti.com\/MyBDTI\/pubs\/AutoPilot.pdf.  BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http:\/\/www.bdti.com\/MyBDTI\/pubs\/AutoPilot.pdf."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SASP.2008.4570793"},{"volume-title":"Proceedings of the TechCon Conference.","author":"Chen D.","key":"e_1_2_1_6_1","unstructured":"Chen , D. , Cong , J. , Fan , Y. , Han , G. , Jiang , W. , and Zhang Z . 2005. XPilot: A platform-based behavioral synthesis system . In Proceedings of the TechCon Conference. Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang Z. 2005. XPilot: A platform-based behavioral synthesis system. In Proceedings of the TechCon Conference."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508128.1508144"},{"key":"e_1_2_1_8_1","unstructured":"CHREC. 2012. NSF center for high performance reconfigurable computing. http:\/\/www.chrec.org\/facilities.html.  CHREC. 2012. NSF center for high performance reconfigurable computing. http:\/\/www.chrec.org\/facilities.html."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2011.2110592"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1344671.1344683"},{"key":"e_1_2_1_11_1","unstructured":"Convey Computer. 2011. http:\/\/www.conveycomputer.com.  Convey Computer. 2011. http:\/\/www.conveycomputer.com."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2004.06.007"},{"key":"e_1_2_1_13_1","volume-title":"NISC: The ultimate reconfigurable component. Tech. rep. 03-28","author":"Gajski D.","year":"2003","unstructured":"Gajski , D. 2003 . NISC: The ultimate reconfigurable component. Tech. rep. 03-28 . Center for Embedded Computer Systems , UCI. http:\/\/www.cecs.uci.edu\/technical_report\/TR03-28.pdf. Gajski, D. 2003. NISC: The ultimate reconfigurable component. Tech. rep. 03-28. Center for Embedded Computer Systems, UCI. http:\/\/www.cecs.uci.edu\/technical_report\/TR03-28.pdf."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1027084.1027087"},{"volume-title":"Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418","author":"He C.","key":"e_1_2_1_15_1","unstructured":"He , C. , Papakonstantinou , A. , and Chen , D . 2009. A novel soc architecture on fpga for ultra fast face detection . In Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418 . He, C., Papakonstantinou, A., and Chen, D. 2009. A novel soc architecture on fpga for ultra fast face detection. In Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-70592-5_5"},{"key":"e_1_2_1_17_1","unstructured":"IBM. 2006. The cell architecture. http:\/\/domino.research.ibm.com\/comm\/research.nsf\/pages\/r.arch.innovation.html.  IBM. 2006. The cell architecture. http:\/\/domino.research.ibm.com\/comm\/research.nsf\/pages\/r.arch.innovation.html."},{"key":"e_1_2_1_18_1","unstructured":"Impact. 2012. Parboil benchmarks. http:\/\/impact.crhc.illinois.edu\/parboil.aspx.  Impact. 2012. Parboil benchmarks. http:\/\/impact.crhc.illinois.edu\/parboil.aspx."},{"key":"e_1_2_1_19_1","unstructured":"Impulse. 2003. Impulse accelerated technologies inc. http:\/\/www.impulseaccelerated.com.  Impulse. 2003. Impulse accelerated technologies inc. http:\/\/www.impulseaccelerated.com."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1450095.1450105"},{"key":"e_1_2_1_21_1","unstructured":"Khronos. 2011. OpenCL specification version 1.1. http:\/\/www.khronos.org\/registry\/cl\/specs\/opencl-1.1.pdf.  Khronos. 2011. OpenCL specification version 1.1. http:\/\/www.khronos.org\/registry\/cl\/specs\/opencl-1.1.pdf."},{"volume-title":"Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553","author":"Lee S.","key":"e_1_2_1_22_1","unstructured":"Lee , S. , Johnson , T. A. , and Eigenmann , R . 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation . In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553 . Lee, S., Johnson, T. A., and Eigenmann, R. 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2010.93"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508128.1508172"},{"key":"e_1_2_1_25_1","unstructured":"LLVM. 2007. The LLVM compiler infrastructure. http:\/\/www.llvm.org.  LLVM. 2007. The LLVM compiler infrastructure. http:\/\/www.llvm.org."},{"key":"e_1_2_1_26_1","unstructured":"Mentor Graphics. 2012. Catapult C synthesis overview. http:\/\/www.mentor.com\/esl\/catapult\/overview\/.  Mentor Graphics. 2012. Catapult C synthesis overview. http:\/\/www.mentor.com\/esl\/catapult\/overview\/."},{"key":"e_1_2_1_27_1","unstructured":"Nallatech. 2012. DATA v5. http:\/\/www.nallatech.com\/Modules\/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html.  Nallatech. 2012. DATA v5. http:\/\/www.nallatech.com\/Modules\/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html."},{"key":"e_1_2_1_28_1","unstructured":"Nvidia. 2012a. CUDA developer zone. http:\/\/developer.nvidia.com\/category\/zone\/cuda-zone.  Nvidia. 2012a. CUDA developer zone. http:\/\/developer.nvidia.com\/category\/zone\/cuda-zone."},{"key":"e_1_2_1_29_1","unstructured":"Nvidia. 2012b. GeForce 8 series. http:\/\/www.nvidia.com\/page\/geforce8.html.  Nvidia. 2012b. GeForce 8 series. http:\/\/www.nvidia.com\/page\/geforce8.html."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2011.19"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the DesignCon Conference.","author":"Parker M.","year":"2011","unstructured":"Parker , M. 2011 . Hardware-based floating-point design flow . In Proceedings of the DesignCon Conference. Parker, M. 2011. Hardware-based floating-point design flow. In Proceedings of the DesignCon Conference."},{"volume-title":"Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing.","author":"Showerman M.","key":"e_1_2_1_32_1","unstructured":"Showerman , M. , Enos , J. , Kidratenko , C. , Steffer , C. , Pennington , R. , and Hwu , W. W . 2009. QP: A heterogeneous multi-accelerator cluster . In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing. Showerman, M., Enos, J., Kidratenko, C., Steffer, C., Pennington, R., and Hwu, W. W. 2009. QP: A heterogeneous multi-accelerator cluster. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-89740-8_2"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1508128.1508139"},{"key":"e_1_2_1_35_1","unstructured":"Tilera. 2012. Tilera corporation. http:\/\/www.tilera.com.  Tilera. 2012. Tilera corporation. http:\/\/www.tilera.com."},{"volume-title":"Proceedings of the 4th Annual Reconfigurable Systems Summer Institute.","author":"Williams J.","key":"e_1_2_1_36_1","unstructured":"Williams , J. , Richardson , J. , Gosrani , K. , and Suresh , S . 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration . In Proceedings of the 4th Annual Reconfigurable Systems Summer Institute. Williams, J., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the 4th Annual Reconfigurable Systems Summer Institute."},{"key":"e_1_2_1_37_1","unstructured":"Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http:\/\/www.xilinx.com\/products\/boards-and-kits\/HW-V5-ML510-G.htm.  Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http:\/\/www.xilinx.com\/products\/boards-and-kits\/HW-V5-ML510-G.htm."},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Zhang Z. Fan Y. Jiang W. Han G. Yang C. and Cong J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit P. Coussy and A. Morawiec Eds. Springer 99--112.  Zhang Z. Fan Y. Jiang W. Han G. Yang C. and Cong J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit P. Coussy and A. Morawiec Eds. Springer 99--112.","DOI":"10.1007\/978-1-4020-8588-8_6"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2514641.2514652","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2514641.2514652","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T08:39:18Z","timestamp":1750235958000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2514641.2514652"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,9]]},"references-count":38,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2013,9]]}},"alternative-id":["10.1145\/2514641.2514652"],"URL":"https:\/\/doi.org\/10.1145\/2514641.2514652","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2013,9]]},"assertion":[{"value":"2011-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2012-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-09-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}