{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,23]],"date-time":"2026-04-23T14:47:01Z","timestamp":1776955621410,"version":"3.51.4"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,2,10]],"date-time":"2023-02-10T00:00:00Z","timestamp":1675987200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100014188","name":"Ministry of Science and ICT","doi-asserted-by":"crossref","award":["NRF-2021-M3F3A2A02037893, NRF-2021R1F1A1062902"],"award-info":[{"award-number":["NRF-2021-M3F3A2A02037893, NRF-2021R1F1A1062902"]}],"id":[{"id":"10.13039\/501100014188","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Korean Government","award":["1711080972, 2021000853"],"award-info":[{"award-number":["1711080972, 2021000853"]}]},{"name":"Creative Pioneering Researchers Program through Seoul National University"},{"name":"Automation and Systems Research Institute (ASRI) and Inter-university Semiconductor Research Center (ISRC) at Seoul National University"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>Deep neural networks (DNNs) have become key solutions in the natural language processing (NLP) domain. However, the existing accelerators customized for their narrow target models cannot support diverse NLP models. Therefore, naively running complex NLP models on the existing accelerators often leads to very marginal performance improvements. For these reasons, architects are now in dire need of a new accelerator that can run various NLP models while taking its full performance potential. In this article, we propose FlexRun, an FPGA-based modular accelerator to efficiently support diverse and complex NLP models. First, we identify key components commonly used by NLP models and implement them on top of a current state-of-the-art FPGA-based accelerator. Next, FlexRun conducts an in-depth design space exploration to find the best accelerator architecture for a target NLP model. Last, FlexRun automatically reconfigures the accelerator based on the exploration results. Our FlexRun design outperforms the current state-of-the-art FPGA-based accelerator by 1.21\u00d7\u20132.73\u00d7 and 1.15\u00d7\u20131.50\u00d7 for BERT and GPT2, respectively. Compared to Nvidia\u2019s V100 GPU, FlexRun achieves 2.69\u00d7 higher performance on average for various BERT and GPT2 models.<\/jats:p>","DOI":"10.1145\/3564606","type":"journal-article","created":{"date-parts":[[2022,10,3]],"date-time":"2022-10-03T12:25:06Z","timestamp":1664799906000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":27,"title":["A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4831-1137","authenticated-orcid":false,"given":"Suyeon","family":"Hur","sequence":"first","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9758-1811","authenticated-orcid":false,"given":"Seongmin","family":"Na","sequence":"additional","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0757-4165","authenticated-orcid":false,"given":"Dongup","family":"Kwon","sequence":"additional","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5432-7813","authenticated-orcid":false,"given":"Joonsung","family":"Kim","sequence":"additional","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8044-1644","authenticated-orcid":false,"given":"Andrew","family":"Boutros","sequence":"additional","affiliation":[{"name":"MangoBoost Inc., United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2347-9590","authenticated-orcid":false,"given":"Eriko","family":"Nurvitadhi","sequence":"additional","affiliation":[{"name":"MangoBoost Inc., United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2193-5748","authenticated-orcid":false,"given":"Jangwoo","family":"Kim","sequence":"additional","affiliation":[{"name":"Seoul National University, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,2,10]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2017. Nvidia Tesla V100 GPU Architecture The World\u2019s Most Advanced Data Center GPU.Retrieved from https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_3_1_3_2","unstructured":"2021. DeepLearningExamples. Retrieved from https:\/\/github.com\/NVIDIA\/DeepLearningExamples\/tree\/master\/TensorFlow."},{"key":"e_1_3_1_4_2","unstructured":"2021. Nsight Systems Release Notes. Retrieved from https:\/\/docs.nvidia.com\/nsight-systems\/ReleaseNotes\/index.html."},{"key":"e_1_3_1_5_2","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). 265\u2013283."},{"key":"e_1_3_1_6_2","article-title":"Neural machine translation by jointly learning to align and translate","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).","journal-title":"arXiv preprint arXiv:1409.0473"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICM52667.2021.9664938"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00011"},{"key":"e_1_3_1_9_2","unstructured":"Chris Leary and Todd Wang. 2017. XLA: TensorFlow compiled.Retrieved from https:\/\/developers.googleblog.com\/2017\/03\/xla-tensorflow-compiled.html."},{"key":"e_1_3_1_10_2","unstructured":"Lance Brown Manish Deo and Jeffrey Schulz. 2019. Intel\u00ae Stratix\u00ae 10 MX Devices with Samsung* HBM2 Solve the Memory Bandwidth Challenge."},{"key":"e_1_3_1_11_2","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).","journal-title":"arXiv:1810.04805"},{"key":"e_1_3_1_12_2","volume-title":"International Symposium on Computer Architecture","author":"Fowers Jeremy","year":"2018","unstructured":"Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et\u00a0al. 2018. A configurable cloud-scale DNN processor for real-time AI. In International Symposium on Computer Architecture."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD50377.2020.00047"},{"key":"e_1_3_1_14_2","first-page":"328","volume-title":"IEEE International Symposium on High Performance Computer Architecture (HPCA)","author":"Ham Tae Jun","year":"2020","unstructured":"Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, et\u00a0al. 2020. A ^ 2303 3: Accelerating attention mechanisms in neural networks with approximation. In IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 328\u2013341."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00060"},{"key":"e_1_3_1_16_2","first-page":"75","volume-title":"ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays","author":"Han Song","year":"2017","unstructured":"Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et\u00a0al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays. 75\u201384."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682336"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","first-page":"250","DOI":"10.1145\/3307650.3322214","volume-title":"46th International Symposium on Computer Architecture","author":"Jang Hanhwi","year":"2019","unstructured":"Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. MnnFast: A fast and scalable system architecture for memory-augmented neural networks. In 46th International Symposium on Computer Architecture. 250\u2013263."},{"key":"e_1_3_1_20_2","article-title":"Beyond data and model parallelism for deep neural networks","author":"Jia Zhihao","year":"2018","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).","journal-title":"arXiv preprint arXiv:1807.05358"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT52795.2021.00013"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.5555\/3437539.3437732"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406567"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480125"},{"key":"e_1_3_1_25_2","first-page":"522","volume-title":"IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","author":"Markidis Stefano","year":"2018","unstructured":"Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. Nvidia tensor core programmability, performance & precision. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522\u2013531."},{"key":"e_1_3_1_26_2","article-title":"Structured pruning of a BERT-based question answering model","author":"McCarley J. S.","year":"2019","unstructured":"J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a BERT-based question answering model. arXiv preprint arXiv:1910.06360 (2019).","journal-title":"arXiv preprint arXiv:1910.06360"},{"key":"e_1_3_1_27_2","volume-title":"IEEE Symposium on Field-Programmable Custom Computing Machines","author":"Nurvitadhi Eriko","year":"2019","unstructured":"Eriko Nurvitadhi, Dongup Kwon, Ali Jafari, Andrew Boutros, Jaewoong Sim, Phillip Tomson, Huseyin Sumbul, Gregory Chen, Phil Knag, Raghavan Kumar, et\u00a0al. 2019. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs. In IEEE Symposium on Field-Programmable Custom Computing Machines."},{"key":"e_1_3_1_28_2","unstructured":"Alec Radford Jeff Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019)."},{"key":"e_1_3_1_29_2","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).","journal-title":"arXiv preprint arXiv:1606.05250"},{"key":"e_1_3_1_30_2","first-page":"446","volume-title":"ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)","author":"Reddi Vijay Janapa","year":"2020","unstructured":"Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et\u00a0al. 2020. MLPerf inference benchmark. In ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446\u2013459."},{"key":"e_1_3_1_31_2","first-page":"318","volume-title":"Learning Internal Representations by Error Propagation","author":"Rumelhart David E.","year":"1987","unstructured":"David E. Rumelhart and James L. McClelland. 1987. Learning Internal Representations by Error Propagation. 318\u2013362."},{"key":"e_1_3_1_32_2","article-title":"SCALE-Sim: Systolic CNN accelerator simulator","author":"Samajdar Ananda","year":"2018","unstructured":"Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).","journal-title":"arXiv preprint arXiv:1811.02883"},{"key":"e_1_3_1_33_2","article-title":"DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).","journal-title":"arXiv preprint arXiv:1910.01108"},{"key":"e_1_3_1_34_2","article-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).","journal-title":"arXiv preprint arXiv:1909.08053"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480095"},{"key":"e_1_3_1_36_2","volume-title":"International Conference on Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD45719.2019.8942127"},{"key":"e_1_3_1_38_2","article-title":"SpAtten: Efficient sparse attention architecture with cascade token and head pruning","volume":"2012","author":"Wang Hanrui","year":"2020","unstructured":"Hanrui Wang, Zhekai Zhang, and Song Han. 2020. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. CoRR abs\/2012.09852 (2020).","journal-title":"CoRR"},{"key":"e_1_3_1_39_2","first-page":"11","volume-title":"ACM\/SIGDA International Symposium on Field-programmable Gate Arrays","author":"Wang Shuo","year":"2018","unstructured":"Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In ACM\/SIGDA International Symposium on Field-programmable Gate Arrays. 11\u201320."},{"key":"e_1_3_1_40_2","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1109\/MICRO50266.2020.00071","volume-title":"53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"Zadeh Ali Hadi","year":"2020","unstructured":"Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference. In 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). IEEE, 811\u2013824."},{"key":"e_1_3_1_41_2","article-title":"Q8BERT: Quantized 8bit BERT","author":"Zafrir Ofir","year":"2019","unstructured":"Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8bit BERT. arXiv preprint arXiv:1910.06188 (2019).","journal-title":"arXiv preprint arXiv:1910.06188"},{"key":"e_1_3_1_42_2","article-title":"TernaryBERT: Distillation-aware ultra-low bit BERT","author":"Zhang Wei","year":"2020","unstructured":"Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. 2020. TernaryBERT: Distillation-aware ultra-low bit BERT. arXiv preprint arXiv:2009.12812 (2020).","journal-title":"arXiv preprint arXiv:2009.12812"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415609"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3564606","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3564606","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:11Z","timestamp":1750183751000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3564606"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,10]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3564606"],"URL":"https:\/\/doi.org\/10.1145\/3564606","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,10]]},"assertion":[{"value":"2022-03-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-05","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}