{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:41:52Z","timestamp":1772725312300,"version":"3.50.1"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2018,1,24]],"date-time":"2018-01-24T00:00:00Z","timestamp":1516752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2018,3,31]]},"abstract":"<jats:p>Using field-programmable gate arrays (FPGAs) as a substrate to deploy soft graphics processing units (GPUs) would enable offering the FPGA compute power in a very flexible GPU-like tool flow. Application-specific adaptations like selective hardening of floating-point operations and instruction set subsetting would mitigate the high area and power demands of soft GPUs. This work explores the capabilities and limitations of soft General Purpose Computing on GPUs (GPGPU) for both fixed- and floating point arithmetic. For this purpose, we have developed FGPU: a configurable, scalable, and portable GPU architecture designed especially for FPGAs. FGPU is open-source and implemented entirely in RTL. It can be programmed in OpenCL and controlled through a Python API. This article introduces its hardware architecture as well as its tool flow. We evaluated the proposed GPGPU approach against multiple other solutions. In comparison to homogeneous Multi-Processor System-On-Chips (MPSoCs), we found that using a soft GPU is a Pareto-optimal solution regarding throughput per area and energy consumption. On average, FGPU has a 2.9\u00d7 better compute density and 11.2\u00d7 less energy consumption than a single MicroBlaze processor when computing in IEEE-754 floating-point format. An average speedup of about 4\u00d7 over the ARM Cortex-A9 supported with the NEON vector co-processor has been measured for fixed- or floating-point benchmarks. In addition, the biggest FGPU cores we could implement on a Xilinx Zynq-7000 System-On-Chip (SoC) can deliver similar performance to equivalent implementations with High-Level Synthesis (HLS).<\/jats:p>","DOI":"10.1145\/3173548","type":"journal-article","created":{"date-parts":[[2018,1,26]],"date-time":"2018-01-26T13:05:50Z","timestamp":1516971950000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["General-Purpose Computing with Soft GPUs on FPGAs"],"prefix":"10.1145","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0365-730X","authenticated-orcid":false,"given":"Muhammed Al","family":"Kadi","sequence":"first","affiliation":[{"name":"Ruhr University of Bochum, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Benedikt","family":"Janssen","sequence":"additional","affiliation":[{"name":"Ruhr University of Bochum, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jones","family":"Yudi","sequence":"additional","affiliation":[{"name":"Ruhr University of Bochum, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Huebner","sequence":"additional","affiliation":[{"name":"Ruhr University of Bochum, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,1,24]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the International Conference on Field-Programmable Technology (FPT\u201912)","author":"A."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847273"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI.2017.32"},{"key":"e_1_2_1_4_1","unstructured":"Altera Corp. Dec. 2015. Stratix 10 Device Overview. Initial Release.  Altera Corp. Dec. 2015. Stratix 10 Device Overview. Initial Release."},{"key":"e_1_2_1_5_1","unstructured":"AMD Inc. 2017. ADM Accelerated Parallel Processing SDK v3.0. Retrieved from http:\/\/developer.amd.com\/amd-accelerated-parallel-processing-app-sdk\/.  AMD Inc. 2017. ADM Accelerated Parallel Processing SDK v3.0. Retrieved from http:\/\/developer.amd.com\/amd-accelerated-parallel-processing-app-sdk\/."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT\u201913)","author":"Andryc K."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 2016 Second Workshop on Overlay Architectures for FPGAs (OLAF\u201916)","author":"Andryc K."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764908"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201915)","author":"Bush J."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2010.85"},{"key":"e_1_2_1_11_1","volume-title":"Theia: Ray Graphic Processing Unit. Retrieved from opencores.com\/project,theia_gpu.","author":"Valverde Diego","year":"2011"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT\u201916)","author":"Al Kadi M."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2912884"},{"key":"e_1_2_1_14_1","unstructured":"Khronos Group. 2012. OpenCL 1.2 Specification. https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/opencl-1.2.pdf.  Khronos Group. 2012. OpenCL 1.2 Specification. https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/opencl-1.2.pdf."},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW\u201910)","author":"Kingyens J."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization (CGO\u201904)","author":"Lattner C."},{"key":"e_1_2_1_17_1","unstructured":"T. Miller. 2016. OpenShader: Open Architecture GPU Simulator and Implementation. Retrieved from sourceforge.net\/projects\/openshader.  T. Miller. 2016. OpenShader: Open Architecture GPU Simulator and Implementation. Retrieved from sourceforge.net\/projects\/openshader."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Muhammed Al Kadi. 2017. FGPU Demo using PYNQ on the Xilinx ZC706. Retrieved from https:\/\/github.com\/malkadi\/FGPU_IPython.  Muhammed Al Kadi. 2017. FGPU Demo using PYNQ on the Xilinx ZC706. Retrieved from https:\/\/github.com\/malkadi\/FGPU_IPython.","DOI":"10.1145\/2847263.2847273"},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Muhammed Al Kadi. 2017. The FGPU Project. Retrieved from https:\/\/github.com\/malkadi\/FGPU.  Muhammed Al Kadi. 2017. The FGPU Project. Retrieved from https:\/\/github.com\/malkadi\/FGPU.","DOI":"10.1145\/2847263.2847273"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT\u201914)","author":"Rashid R."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/2555692.2555698"},{"key":"e_1_2_1_22_1","unstructured":"VectorBlox Computing Inc. 2017. The MXP Vector Matrix Processor Repository. Retrieved from https:\/\/github.com\/VectorBlox\/mxp.  VectorBlox Computing Inc. 2017. The MXP Vector Matrix Processor Repository. Retrieved from https:\/\/github.com\/VectorBlox\/mxp."},{"key":"e_1_2_1_23_1","unstructured":"Xilinx Inc. 2015. AXI DMA LogiCORE IP Product Guide (PG021 v7.1). https:\/\/www.xilinx.com\/support\/documentation\/ipdocumentation\/axidma\/v71\/pg021axidma.pdf.  Xilinx Inc. 2015. AXI DMA LogiCORE IP Product Guide (PG021 v7.1). https:\/\/www.xilinx.com\/support\/documentation\/ipdocumentation\/axidma\/v71\/pg021axidma.pdf."},{"key":"e_1_2_1_24_1","unstructured":"Xilinx Inc. 2015. Floating-Point Operator v7.1 LogiCORE IP Product Guide (PG060). https:\/\/www.xilinx.com\/support\/documentation\/ipdocumentation\/floatingpoint\/v71\/pg060-floating-point.pdf.  Xilinx Inc. 2015. Floating-Point Operator v7.1 LogiCORE IP Product Guide (PG060). https:\/\/www.xilinx.com\/support\/documentation\/ipdocumentation\/floatingpoint\/v71\/pg060-floating-point.pdf."},{"key":"e_1_2_1_25_1","unstructured":"Xilinx Inc. 2016. 7 Series FPGAs Configurable Logic Block v1.8 (UG474). https:\/\/www.xilinx.com\/support\/documentation\/userguides\/ug4747SeriesCLB.pdf.  Xilinx Inc. 2016. 7 Series FPGAs Configurable Logic Block v1.8 (UG474). https:\/\/www.xilinx.com\/support\/documentation\/userguides\/ug4747SeriesCLB.pdf."},{"key":"e_1_2_1_26_1","unstructured":"Xilinx Inc. 2016. The PYNQ Project. http:\/\/www.pynq.io {Online; accessed 15-Jan-2017}.  Xilinx Inc. 2016. The PYNQ Project. http:\/\/www.pynq.io {Online; accessed 15-Jan-2017}."},{"key":"e_1_2_1_27_1","unstructured":"Xilinx Inc. 2016. UltraScale Architecture and Product Overview (v3.1) DS890. https:\/\/www.xilinx.com\/support\/documentation\/datasheets\/ds890-ultrascale-overview.pdf.  Xilinx Inc. 2016. UltraScale Architecture and Product Overview (v3.1) DS890. https:\/\/www.xilinx.com\/support\/documentation\/datasheets\/ds890-ultrascale-overview.pdf."},{"key":"e_1_2_1_28_1","unstructured":"Xilinx Inc. 2016. Zynq-7000 All Programmable SoC Technical Reference Manual (UG585 v1.12.1). https:\/\/www.xilinx.com\/support\/documentation\/userguides\/ug585-Zynq-7000-TRM.pdf.  Xilinx Inc. 2016. Zynq-7000 All Programmable SoC Technical Reference Manual (UG585 v1.12.1). https:\/\/www.xilinx.com\/support\/documentation\/userguides\/ug585-Zynq-7000-TRM.pdf."},{"key":"e_1_2_1_29_1","unstructured":"Xilinx Inc. 2016. SDAccel Development Environment Methodology Guide Performance Optimization (UG1207 v2.0). https:\/\/www.xilinx.com\/support\/documentation\/swmanuals\/ug1207-sdaccel-performance-optimization.pdf. (August 2016). Ch. 7.  Xilinx Inc. 2016. SDAccel Development Environment Methodology Guide Performance Optimization (UG1207 v2.0). https:\/\/www.xilinx.com\/support\/documentation\/swmanuals\/ug1207-sdaccel-performance-optimization.pdf. (August 2016). Ch. 7."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629395.1629411"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3173548","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3173548","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:02:46Z","timestamp":1750215766000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3173548"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,1,24]]},"references-count":30,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2018,3,31]]}},"alternative-id":["10.1145\/3173548"],"URL":"https:\/\/doi.org\/10.1145\/3173548","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,1,24]]},"assertion":[{"value":"2017-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-01-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}