{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:59Z","timestamp":1750220519414,"version":"3.41.0"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,6,30]],"date-time":"2021-06-30T00:00:00Z","timestamp":1625011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["1205721"],"award-info":[{"award-number":["1205721"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2021,6,30]]},"abstract":"<jats:p>\n            Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our\n            <jats:bold>persistent deep learning<\/jats:bold>\n            <jats:bold>(PDL<\/jats:bold>\n            )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4\u20133\u00d7 more ALMs, 4.4\u20136.4\u00d7 more M20ks, and 1\u20139.5\u00d7 more DSPs than baseline, but improves performance by 56\u2013693\u00d7 for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance\/power and demonstrate that PDL-FGPU is only 4.0\u201310.4\u00d7 slower than the Nvidia V100.\n          <\/jats:p>","DOI":"10.1145\/3457886","type":"journal-article","created":{"date-parts":[[2021,7,15]],"date-time":"2021-07-15T16:57:06Z","timestamp":1626368226000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Specializing FGPU for Persistent Deep Learning"],"prefix":"10.1145","volume":"14","author":[{"given":"Rui","family":"Ma","sequence":"first","affiliation":[{"name":"The University of Texas at Austin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jia-Ching","family":"Hsu","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tian","family":"Tan","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eriko","family":"Nurvitadhi","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David","family":"Sheffield","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rob","family":"Pelt","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Martin","family":"Langhammer","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jaewoong","family":"Sim","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aravind","family":"Dasu","sequence":"additional","affiliation":[{"name":"Intel Corporation, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Derek","family":"Chiou","sequence":"additional","affiliation":[{"name":"Microsoft and The University of Texas at Austin, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,7,15]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173548"},{"volume-title":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201919)","author":"Nurvitadhi E.","key":"e_1_2_1_2_1","unstructured":"E. Nurvitadhi , D. Kwon , A. Jafari , A. Boutros , J. Sim , P. Tomson , H. Sumbul , G. Chen , P. Knag , R. Kumar , R. Krishnamurthy , S. Gribok , B. Pasca , M. Langhammer , D. Marr , and A. Dasu . 2019. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs . In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201919) . 199\u2013207. E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar, R. Krishnamurthy, S. Gribok, B. Pasca, M. Langhammer, D. Marr, and A. Dasu. 2019. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201919). 199\u2013207."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00059"},{"key":"e_1_2_1_4_1","volume-title":"Retrieved on","author":"Intel Corporation","year":"2020","unstructured":"Intel Corporation . 2020 . Open Programmable Acceleration Engine . Retrieved on Jun 20, 2019 from https:\/\/01.org\/opae. Intel Corporation. 2020. Open Programmable Acceleration Engine. Retrieved on Jun 20, 2019 from https:\/\/01.org\/opae."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MWSCAS.2017.8053243"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045604"},{"key":"e_1_2_1_7_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HkxF5RgC-","author":"Zhu Feiwen","year":"2018","unstructured":"Feiwen Zhu , Jeff Pool , Michael Andersch , Jeremy Appleyard , and Fung Xie . 2018 . Sparse persistent RNNs: Squeezing large recurrent networks on-chip . In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HkxF5RgC- Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, and Fung Xie. 2018. Sparse persistent RNNs: Squeezing large recurrent networks on-chip. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=HkxF5RgC-"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303949"},{"key":"e_1_2_1_9_1","unstructured":"Intel Corporation. 2018. Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual.  Intel Corporation. 2018. Intel\u00ae 64 and IA-32 Architectures Software Developer\u2019s Manual."},{"key":"e_1_2_1_10_1","volume-title":"Retrieved on","author":"Kernel Sources PDL-FGPU","year":"2019","unstructured":"PDL-FGPU Kernel Sources . Retrieved on Jun 20, 2019 from https:\/\/github.com\/paleolithicman\/PDL-FGPU_kernels. PDL-FGPU Kernel Sources. Retrieved on Jun 20, 2019 from https:\/\/github.com\/paleolithicman\/PDL-FGPU_kernels."},{"key":"e_1_2_1_11_1","unstructured":"MIPS Technologies. 2001. MIPS32\u00ae Architecture For Programmers Volume II: The MIPS32\u00ae Instruction Set.  MIPS Technologies. 2001. MIPS32\u00ae Architecture For Programmers Volume II: The MIPS32\u00ae Instruction Set."},{"volume-title":"Retrieved on","year":"2020","key":"e_1_2_1_12_1","unstructured":"Baidu. 2020 . DeepBench . Retrieved on Jun 20, 2019 from https:\/\/github.com\/baidu-research\/DeepBench. Baidu. 2020. DeepBench. Retrieved on Jun 20, 2019 from https:\/\/github.com\/baidu-research\/DeepBench."},{"volume-title":"Retrieved on","year":"2019","key":"e_1_2_1_13_1","unstructured":"2018. cuDNN Developer Guide . Retrieved on Jun 20, 2019 from https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html. 2018. cuDNN Developer Guide. Retrieved on Jun 20, 2019 from https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html."},{"key":"e_1_2_1_14_1","volume-title":"Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho , Bart Van Merri\u00ebnboer , Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014 . Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ReConFig.2016.7857151"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2017.7858394"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.5555\/3130379.3130707"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021745"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2017.25"},{"key":"e_1_2_1_20_1","volume-title":"Giulio Gambardella, Norbert Wehn, and Michaela Blott.","author":"Rybalkin Vladimir","year":"2018","unstructured":"Vladimir Rybalkin , Alessandro Pappalardo , Muhammad Mohsin Ghaffar , Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018 . FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. CoRR abs\/1807 .04093 (2018). arxiv:1807.04093. Retrieved on Jun 20, 2019 from http:\/\/arxiv.org\/abs\/1807.04093. Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, and Michaela Blott. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. CoRR abs\/1807.04093 (2018). arxiv:1807.04093. Retrieved on Jun 20, 2019 from http:\/\/arxiv.org\/abs\/1807.04093."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.022071131"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM48280.2020.00011"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11265-020-01549-8"},{"key":"e_1_2_1_24_1","unstructured":"Daniele Bagni A. Di Fresco J. Noguera and F. M. Vallina. 2016. A Zynq accelerator for floating point matrix multiplication designed with vivado HLS. Application Note (2016) 39\u201341.  Daniele Bagni A. Di Fresco J. Noguera and F. M. Vallina. 2016. A Zynq accelerator for floating point matrix multiplication designed with vivado HLS. Application Note (2016) 39\u201341."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/2555692.2555698"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1450095.1450107"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2010.5470679"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CoolChips.2015.7158663"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/3277355.3277446"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3457886","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3457886","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3457886","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:38Z","timestamp":1750195478000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3457886"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,30]]},"references-count":29,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,6,30]]}},"alternative-id":["10.1145\/3457886"],"URL":"https:\/\/doi.org\/10.1145\/3457886","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2021,6,30]]},"assertion":[{"value":"2019-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}