{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T14:57:40Z","timestamp":1763737060220,"version":"3.45.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"name":"National Science Foundation","award":["1955820"],"award-info":[{"award-number":["1955820"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>The matrix operations that underpin today\u2019s deep learning models are routinely implemented in Single Instruction Multiple Data (SIMD) domain specific accelerators. SIMD accelerators including GPUs and array processors can effectively leverage parallelism in models that are compute-bound, but their effectiveness can be diminished for models that are memory-bound. Processing-in-Memory (PIM) architectures are being explored to provide better energy efficiency and scalable performance for these memory-bound models. Modern Field Programmable Gate Arrays (FPGAs) feature hundreds of megabits of Static Random Access Memory (SRAM) distributed across the device as disaggregated memory resources. This makes FPGAs ideal programmable platforms for developing custom Processor In\/Near Memory accelerators. Several PIM array-based accelerator designs have been proposed to leverage this substantial internal bandwidth. However, results reported to date show the FPGA based PIM architectures operating at system clock frequencies well below a chips Block-RAM (BRAM) Fmax clock frequency. Results also show that the compute densities of the designs do not scale linearly with BRAM densities. These results indicate that FPGA PIM architectures will never be competitive with their custom Application-Specific Integrated Circuit (ASIC) counterparts.<\/jats:p>\n                  <jats:p>\n                    In this article, we introduce DA-VinCi, a\n                    <jats:italic toggle=\"yes\">D<\/jats:italic>\n                    eep-Learning\n                    <jats:italic toggle=\"yes\">A<\/jats:italic>\n                    ccelerator O\n                    <jats:italic toggle=\"yes\">v<\/jats:italic>\n                    erlay using\n                    <jats:italic toggle=\"yes\">In<\/jats:italic>\n                    -Memory\n                    <jats:italic toggle=\"yes\">C<\/jats:italic>\n                    omput\n                    <jats:italic toggle=\"yes\">i<\/jats:italic>\n                    ng. DA-VinCi is the first scalable FPGA based PIM deep-learning accelerator overlay capable of clocking at the maximum frequency of a device\u2019s BRAM. Further, the architecture of DA-VinCi allows the number of compute units to scale linearly up to the maximum capacity of a devices BRAM, and at the maximum clock frequency of the BRAM. The DA-VinCi overlay has a programmable Instruction Set Architecture (ISA) that allows the same synthesized design to provide low-latency inferencing of a range of memory-bound deep-learning models, including Multilayer Perceptrons, Recurrent Neural Network, Long Short-Term Memory, and Gated Recurrent Unit networks. The scalability and high clocking frequency of DA-VinCi is achieved through a new Processor In Memory (PIM) tile architecture and a highly scalable system-level framework. We present results showing DA-VinCi linearly scaling the number of Processing Elements (PEs) to 100% of the BRAM capacity (over 60K PEs) on an Alveo U55 clocking at 737\u2009MHz, the chips BRAM Fmax. We provide comparative studies on inference latency across multiple deep-learning applications that show DA-VinCi achieves up to a 201\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    improvement over a state-of-the-art PIM overlay accelerator, up to 87\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    improvement over existing PIM-based FPGA accelerators, and up to 57\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    improvement over custom deep-learning accelerators on FPGAs.\n                  <\/jats:p>","DOI":"10.1145\/3770756","type":"journal-article","created":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T13:49:20Z","timestamp":1759844960000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["DA-VinCi: A Deep-Learning Accelerator Overlay Using In-Memory Computing"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9920-2985","authenticated-orcid":false,"given":"MD Arafat","family":"Kabir","sequence":"first","affiliation":[{"name":"University of Arkansas, Fayetteville, Arkansas, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5329-8233","authenticated-orcid":false,"given":"Nathaniel","family":"Fredricks","sequence":"additional","affiliation":[{"name":"University of Arkansas, Fayetteville, Arkansas, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7853-5780","authenticated-orcid":false,"given":"Tendayi","family":"Kamucheka","sequence":"additional","affiliation":[{"name":"University of Arkansas Fayetteville, Fayetteville, Arkansas, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9277-5043","authenticated-orcid":false,"given":"Joel","family":"Mandebi","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices, Inc., (AMD), Santa Clara, California, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7376-3744","authenticated-orcid":false,"given":"Miaoqing","family":"Huang","sequence":"additional","affiliation":[{"name":"University of Arkansas, Fayetteville, Arkansas, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0821-6258","authenticated-orcid":false,"given":"Jason D.","family":"Bakos","sequence":"additional","affiliation":[{"name":"University of South Carolina, Columbia, South Carolina, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1464-7107","authenticated-orcid":false,"given":"David","family":"Andrews","sequence":"additional","affiliation":[{"name":"University of Arkansas, Fayetteville, Arkansas, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2010.2041501"},{"key":"e_1_3_1_3_2","first-page":"1","volume-title":"Proceedings of the 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST)","author":"Danopoulos Dimitrios","year":"2018","unstructured":"Dimitrios Danopoulos, Christoforos Kachris, and Dimitrios Soudris. 2018. Acceleration of image classification with caffe framework using FPGA. In Proceedings of the 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST), 1\u20134."},{"key":"e_1_3_1_4_2","first-page":"14","volume-title":"Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT)","author":"Fan Hongxiang","year":"2018","unstructured":"Hongxiang Fan, Shuanglong Liu, Martin Ferianc, Ho-Cheung Ng, Zhiqiang Que, Shen Liu, Xinyu Niu, and Wayne Luk. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), 14\u201321."},{"key":"e_1_3_1_5_2","first-page":"75","volume-title":"Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Han Song","year":"2017","unstructured":"Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William (Bill) J. Dally. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays, 75\u201384."},{"key":"e_1_3_1_6_2","first-page":"121","volume-title":"Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS)","author":"Hao Cong","year":"2019","unstructured":"Cong Hao, Atif Sarwari, Zhijie Jin, Husam Abu-Haimed, Daryl Sew, Yuhong Li, Xinheng Liu, Bryan Wu, Dongdong Fu, Junli Gu, and Deming Chen. 2019. A hybrid GPU + FPGA system design for autonomous driving cars. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), 121\u2013126."},{"key":"e_1_3_1_7_2","first-page":"1","volume-title":"Proceedings of the 2021 XI Brazilian Symposium on Computing Systems Engineering (SBESC)","author":"Klock Jo\u00e3o Pedro","year":"2021","unstructured":"Jo\u00e3o Pedro Klock, Jhonatan Corr\u00eaa, Miguel Bessa, Janier Arias-Garcia, Felipe Barboza, and Carmo Meinertz. 2021. A new automated energy meter fraud detection system based on artificial intelligence. In Proceedings of the 2021 XI Brazilian Symposium on Computing Systems Engineering (SBESC), 1\u20138."},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1109\/FPT.2018.00087","volume-title":"Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT)","author":"Kojima Akira","year":"2018","unstructured":"Akira Kojima and Yohei Nose. 2018. Development of an autonomous driving robot car using FPGA. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), 411\u2013414."},{"key":"e_1_3_1_9_2","first-page":"230","volume-title":"Proceedings of the 2016 IEEE International Workshop on Signal Processing Systems (SiPS)","author":"Lee Minjae","year":"2016","unstructured":"Minjae Lee, Kyuyeon Hwang, Jinhwan Park, Sungwook Choi, Sungho Shin, and Wonyong Sung. 2016. FPGA-based low-power speech recognition with recurrent neural networks. In Proceedings of the 2016 IEEE International Workshop on Signal Processing Systems (SiPS), 230\u2013235."},{"key":"e_1_3_1_10_2","first-page":"1894","volume-title":"Proceedings of the 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE)","author":"Lv Peng","year":"2020","unstructured":"Peng Lv, Wei Liu, and Jinghui Li. 2020. A FPGA-based accelerator implementaion for YOLOv2 object detection using Winograd algorithm. In Proceedings of the 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), 1894\u20131898."},{"issue":"8","key":"e_1_3_1_11_2","doi-asserted-by":"crossref","first-page":"1861","DOI":"10.1109\/TVLSI.2019.2905242","article-title":"A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection","volume":"27","author":"Nguyen Duy Thanh","year":"2019","unstructured":"Duy Thanh Nguyen, Tuan Nghia Nguyen, Hyun Kim, and Hyuk-Jae Lee. 2019. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 8 (2019), 1861\u20131873.","journal-title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems"},{"key":"e_1_3_1_12_2","first-page":"1","volume-title":"Proceedings of the 2019 International Symposium on Advanced Electrical and Communication Technologies (ISAECT)","author":"Sa\u011flam Serkan","year":"2019","unstructured":"Serkan Sa\u011flam, Fatih Tat, and Salih Bayar. 2019. FPGA implementation of CNN algorithm for detecting malaria diseased blood cells. In Proceedings of the 2019 International Symposium on Advanced Electrical and Communication Technologies (ISAECT), 1\u20135."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.3233\/IDT-160261"},{"key":"e_1_3_1_14_2","first-page":"485","volume-title":"Proceedings of the 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE)","author":"Syu Dong-Fong","year":"2015","unstructured":"Dong-Fong Syu, Su-Wei Syu, Shanq-Jang Ruan, Yu-Chang Huang, and Chuan-Kai Yang. 2015. FPGA implementation of automatic speech recognition system in a car environment. In Proceedings of the 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE), 485\u2013486."},{"key":"e_1_3_1_15_2","first-page":"571","volume-title":"Proceedings of the 2021 11th International Conference on Information Science and Technology (ICIST)","author":"Wang Jin","year":"2021","unstructured":"Jin Wang and Shenshen Gu. 2021. FPGA implementation of object detection accelerator based on Vitis-AI. In Proceedings of the 2021 11th International Conference on Information Science and Technology (ICIST), 571\u2013577."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1587\/transinf.2020EDL8153"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3126838"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/2171\/1\/012010"},{"issue":"18","key":"e_1_3_1_19_2","doi-asserted-by":"crossref","first-page":"2272","DOI":"10.3390\/electronics10182272","article-title":"An efficient FPGA-based convolutional neural network for classification: Ad-MobileNet","volume":"10","author":"Bouguezzi Safa","year":"2021","unstructured":"Safa Bouguezzi, Hana Ben Fredj, Tarek Belabed, Carlos Valderrama, Hassene Faiedh, and Chokri Souani. 2021. An efficient FPGA-based convolutional neural network for classification: Ad-MobileNet. Electronics 10, 18 (2021), 2272.","journal-title":"Electronics"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-021-04347-6"},{"key":"e_1_3_1_21_2","first-page":"1","volume-title":"Proceedings of the 2020 IEEE Symposium on VLSI Circuits","author":"Kim Ji-Hoon","year":"2020","unstructured":"Ji-Hoon Kim, Juhyoung Lee, Jinsu Lee, Hoi-Jun Yoo, and Joo-Young Kim. June. 2020. Z-PIM: An energy-efficient sparsity aware processing-in-memory architecture with fully-variable weight precision. In Proceedings of the 2020 IEEE Symposium on VLSI Circuits. IEEE, 1\u20132."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2022.3211290"},{"key":"e_1_3_1_23_2","first-page":"24","volume-title":"Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)","author":"Lee Chia-Fu","year":"2022","unstructured":"Chia-Fu Lee, Cheng-Han Lu, Cheng-En Lee, Haruki Mori, Hidehiro Fujiwara, Yi-Chun Shih, Tan-Li Chou, Yu-Der Chih, and Tsung-Yung Jonathan Chang. 2022. A 12nm 121-TOPS\/W 41.6-TOPS\/mm2 all digital full precision SRAM-based compute-in-memory with configurable bit-width for AI edge applications. In Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 24\u201325."},{"key":"e_1_3_1_24_2","first-page":"1","volume-title":"Proceedings of the 2023 IEEE Hot Chips 35 Symposium (HCS)","author":"Kwon Yongkee","year":"2023","unstructured":"Yongkee Kwon, Guhyun Kim, Nahsung Kim, Woojae Shin, Jongsoon Won, Hyunha Joo, Haerang Choi, Byeongju An, Gyeongcheol Shin, Dayeon Yun, et al. 2023. Memory-centric computing with SK Hynix\u2019s domain-specific memory. In Proceedings of the 2023 IEEE Hot Chips 35 Symposium (HCS), 1\u201326."},{"key":"e_1_3_1_25_2","first-page":"88","volume-title":"Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Wang Xiaowei","year":"2021","unstructured":"Xiaowei Wang, Vidushi Goyal, Jiecao Yu, Valeria Bertacco, Andrew Boutros, Eriko Nurvitadhi, Charles Augustine, Ravi R. Iyer, and Reetuparna Das. 2021. Compute-capable block RAMs for efficient deep learning acceleration on FPGAs. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 88\u201396."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDIS50059.2020.00025"},{"key":"e_1_3_1_27_2","first-page":"24","volume-title":"Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL)","author":"Panahi Atiyehsadat","year":"2021","unstructured":"Atiyehsadat Panahi, Suhail Balsalama, Ange-Thierry Ishimwe, Joel Mandebi Mbongue, and David Andrews. 2021. A customizable domain-specific memory-centric FPGA overlay for machine learning applications. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), 24\u201327."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.5555\/AAI29319767"},{"key":"e_1_3_1_29_2","first-page":"1","volume-title":"Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Arora A.","year":"2022","unstructured":"A. Arora, T. Anand, A. Borda, R. Sehgal, B. Hanindhito, J. Kulkarni, and L. K. John. 2022. CoMeFa: Compute-in-memory blocks for FPGAs. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 1\u20139."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603504"},{"key":"e_1_3_1_31_2","first-page":"52","volume-title":"Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Chen Yuzong","year":"2023","unstructured":"Yuzong Chen and Mohamed S. Abdelfattah. 2023. BRAMAC: Compute-in-BRAM architectures for multiply-accumulate on FPGAs. In Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 52\u201362."},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Yuzong Chen Jordan Dotzel and Mohamed S. Abdelfattah. 2023. M4BRAM: Mixed-precision matrix-matrix multiplication in FPGA block RAMs. arXiv:2311.02758. Retrieved from http:\/\/arxiv.org\/abs\/2311.02758","DOI":"10.1109\/ICFPT59805.2023.00013"},{"key":"e_1_3_1_33_2","first-page":"224","volume-title":"Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Kabir M. D. Arafat","year":"2023","unstructured":"M. D. Arafat Kabir, Joshua Hollis, Atiyehsadat Panahi, Jason Bakos, Miaoqing Huang, and David Andrews. 2023.Making BRAMs compute: Creating scalable computational memory fabric overlays. In Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 224\u2013224."},{"key":"e_1_3_1_34_2","first-page":"109","volume-title":"Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)","author":"Kabir M. D Arafat","year":"2023","unstructured":"M. D Arafat Kabir, Ehsan Kabir, Joshua Hollis, Eli Levy-Mackay, Atiyehsadat Panahi, Jason Bakos, Miaoqing Huang, and David Andrews. 2023. FPGA processor in memory architectures (PIMs): Overlay or overhaul? In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 109\u2013115."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_1_36_2","unstructured":"M. D. Arafat Kabir Tendayi Kamucheka Nathaniel Fredricks Joel Mandebi Jason Bakos Miaoqing Huang and David Andrews. 2024. IMAGine: An In-Memory Accelerated GEMV Engine Overlay. Retrieved from https:\/\/github.com\/Arafat-Kabir\/IMAGine"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL64840.2024.00038"},{"key":"e_1_3_1_38_2","unstructured":"M. D. Arafat Kabir Tendayi Kamucheka Nathaniel Fredricks Joel Mandebi Jason Bakos Miaoqing Huang and David Andrews. 2024. DA-VinCi: A Deeplearning Accelerator Overlay Using In-Memory Computing. Retrieved from https:\/\/github.com\/Arafat-Kabir\/DA-VinCi"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/T-C.1969.222754"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1970.5008902"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00076"},{"key":"e_1_3_1_42_2","unstructured":"Christopher Wolters Xiaoxuan Yang Ulf Schlichtmann and Toyotaro Suzumura. 2024. Memory is all you need: An overview of compute-in-memory architectures for accelerating large language model inference. arXiv:2406.08413. Retrieved from https:\/\/arxiv.org\/abs\/2406.08413"},{"issue":"1","key":"e_1_3_1_43_2","doi-asserted-by":"crossref","first-page":"011305","DOI":"10.1063\/1.5129306","article-title":"The building blocks of a brain-inspired computer","volume":"7","author":"Kendall Jack D.","year":"2020","unstructured":"Jack D. Kendall and Suhas Kumar. 2020. The building blocks of a brain-inspired computer. Applied Physics Reviews 7, 1 (2020), 011305.","journal-title":"Applied Physics Reviews"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2022.3202350"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.375174"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CICC.1992.591879"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/40.592312"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.1997.628842"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.612252"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3174101"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2019.8875680"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS52781.2021.9567191"},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1109\/ISCA.2018.00040","volume-title":"Proceedings of the 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)","author":"Eckert Charles","year":"2018","unstructured":"Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 383\u2013396."},{"key":"e_1_3_1_54_2","first-page":"88","volume-title":"Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Wang Xiaowei","year":"2021","unstructured":"Xiaowei Wang, Vidushi Goyal, Jiecao Yu, Valeria Bertacco, Andrew Boutros, Eriko Nurvitadhi, Charles Augustine, Ravi Iyer, and Reetuparna Das. 2021. Compute-capable block RAMs for efficient deep learning acceleration on FPGAs. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 88\u201396."},{"key":"e_1_3_1_55_2","first-page":"1","volume-title":"Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Arora Aman","year":"2022","unstructured":"Aman Arora, Tanmay Anand, Aatman Borda, Rishabh Sehgal, Bagus Hanindhito, Jaydeep Kulkarni, and Lizy K. John. 2022. CoMeFa: Compute-in-memory blocks for FPGAs. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 1\u20139."},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1145\/3020078.3021738","volume-title":"Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Aydonat Utku","year":"2017","unstructured":"Utku Aydonat, Shane O\u2019Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL\u2122 deep learning accelerator on arria 10. In Proceedings of the 2017 ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 55\u201364."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174261"},{"issue":"1","key":"e_1_3_1_58_2","doi-asserted-by":"crossref","first-page":"1098","DOI":"10.1109\/TNNLS.2022.3180209","article-title":"Spartus: A 9.4 TOp\/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity","volume":"35","author":"Gao Chang","year":"2024","unstructured":"Chang Gao, Tobi Delbruck, and Shih-Chii Liu. 2024. Spartus: A 9.4 TOp\/s FPGA-based LSTM accelerator exploiting spatio-temporal sparsity. IEEE Transactions on Neural Networks and Learning Systems 35, 1 (2024), 1098\u20131112.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_1_59_2","first-page":"32","volume-title":"Proceedings of the International Symposium on Applied Reconfigurable Computing","author":"Kabir Ehsan","year":"2022","unstructured":"Ehsan Kabir, Arpan Poudel, Zeyad Aklah, Miaoqing Huang, and David Andrews. 2022. A runtime programmable accelerator for convolutional and multilayer perceptron neural networks on FPGA. In Proceedings of the International Symposium on Applied Reconfigurable Computing. Springer, 32\u201346."},{"key":"e_1_3_1_60_2","first-page":"327","volume-title":"Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)","author":"Kabir Ehsan","year":"2023","unstructured":"Ehsan Kabir, Daniel Coble, Joud N. Satme, Austin R. J. Downey, Jason D. Bakos, David Andrews, and Miaoqing Huang. 2023. Accelerating LSTM-based high-rate dynamic system models. In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 327\u2013332."},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ymssp.2019.106551"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543069"},{"key":"e_1_3_1_63_2","unstructured":"M. D. Arafat Kabir Ehsan Kabir Joshua Hollis Eli Levy-Mackay Atiyehsadat Panahi Jason Bakos Miaoqing Huang and David Andrews. 2023. PiCaSO: A Scalable and Fast PIM Overlay. Retrieved from https:\/\/github.com\/Arafat-Kabir\/PiCaSO"},{"key":"e_1_3_1_64_2","first-page":"88","volume-title":"Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Wang X.","year":"2021","unstructured":"X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi, C. Augustine, R. Iyer, and R. Das. 2021. Compute-capable block RAMs for efficient deep learning acceleration on FPGAs. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 88\u201396."},{"key":"e_1_3_1_65_2","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1109\/CGO.2004.1281665","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO \u201904)","author":"Lattner Chris","year":"2004","unstructured":"Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO \u201904). IEEE, 75\u201386. Retrieved from https:\/\/llvm.org\/docs\/index.html"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO51591.2021.9370308"},{"key":"e_1_3_1_67_2","unstructured":"M. D. Arafat Kabir. 2024. DA-VinCi IR3 Assember (DavinciAsm) Version 0.1. Retrieved from https:\/\/github.com\/Arafat-Kabir\/DA-VinCi\/blob\/master\/work\/scripts\/davinci_assembler.py"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10710-017-9314-z"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2013.6707742"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2017.7858394"},{"key":"e_1_3_1_73_2","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1109\/ISCA.2018.00040","volume-title":"Proceedings of the 2018 ACM\/IEEE 45Th Annual International Symposium on Computer Architecture (ISCA)","author":"Eckert Charles","year":"2018","unstructured":"Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the 2018 ACM\/IEEE 45Th Annual International Symposium on Computer Architecture (ISCA), 383\u2013396."},{"key":"e_1_3_1_74_2","unstructured":"Intel. 2022. Intel\u00ae Stratix\u00ae 10 Device Datasheet. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/docs\/programmable\/683181\/current\/memory-block-specifications.html"},{"key":"e_1_3_1_75_2","volume-title":"DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus","author":"Lyons John W.","year":"1993","unstructured":"John W. Lyons. 1993. DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus. National Institute of Standards and Technology."},{"key":"e_1_3_1_76_2","unstructured":"TensorFlow. 2017. Shakespear Dataset. Retrieved from https:\/\/storage.googleapis.com\/download.tensorflow.org\/data\/shakespeare.txt"},{"key":"e_1_3_1_77_2","unstructured":"Yann LeCun Corinna Cortes Chris Burges. 2010. MNIST handwritten digit database. Retrieved from https:\/\/ieeexplore.ieee.org\/document\/6296535"},{"key":"e_1_3_1_78_2","unstructured":"Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv:1412.5567. Retrieved from https:\/\/arxiv.org\/abs\/1412.5567"},{"key":"e_1_3_1_79_2","unstructured":"Baidu Research. 2016. Baidu DeepBench. Retrieved from https:\/\/svail.github.io\/DeepBench\/"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3770756","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T13:54:44Z","timestamp":1763733284000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3770756"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":78,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3770756"],"URL":"https:\/\/doi.org\/10.1145\/3770756","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2025-03-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}