{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,11]],"date-time":"2026-06-11T16:17:00Z","timestamp":1781194620211,"version":"3.54.1"},"reference-count":105,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,12,17]],"date-time":"2024-12-17T00:00:00Z","timestamp":1734393600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"ACE"},{"DOI":"10.13039\/100000028","name":"Semiconductor Research Corporation","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000028","id-type":"DOI","asserted-by":"crossref"}]},{"name":"DARPA and NSF","award":["No. 2007832, No. 2019306, and No. 2118709"],"award-info":[{"award-number":["No. 2007832, No. 2019306, and No. 2118709"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead.<\/jats:p>\n          <jats:p>This article investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on field-programmable gate arrays (FPGAs). Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart.<\/jats:p>\n          <jats:p>To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT2) on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4\u00d7 speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2\u00d7 speedup compared to Design for Excellence, an FPGA overlay, in the prefill stage, while achieving a 1.9\u00d7 speedup and a 5.7\u00d7 improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.<\/jats:p>","DOI":"10.1145\/3656177","type":"journal-article","created":{"date-parts":[[2024,4,4]],"date-time":"2024-04-04T12:19:27Z","timestamp":1712233167000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":66,"title":["Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6617-0075","authenticated-orcid":false,"given":"Hongzheng","family":"Chen","sequence":"first","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8379-7489","authenticated-orcid":false,"given":"Jiahao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6106-1283","authenticated-orcid":false,"given":"Yixiao","family":"Du","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6901-8837","authenticated-orcid":false,"given":"Shaojie","family":"Xiang","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8585-5947","authenticated-orcid":false,"given":"Zichao","family":"Yue","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2850-0176","authenticated-orcid":false,"given":"Niansong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3785-3413","authenticated-orcid":false,"given":"Yaohui","family":"Cai","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0778-0308","authenticated-orcid":false,"given":"Zhiru","family":"Zhang","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,12,17]]},"reference":[{"key":"e_1_3_3_2_2","volume-title":"Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation."},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3570928"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242897"},{"key":"e_1_3_3_6_2","article-title":"On the opportunities and risks of foundation models","author":"Bommasani Rishi","year":"2021","unstructured":"Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill et\u00a0al. 2021. On the opportunities and risks of foundation models. Retrieved from https:\/\/arXiv:2108.07258","journal-title":"R"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783710"},{"key":"e_1_3_3_8_2","article-title":"QuIP: 2-bit quantization of large language models with guarantees","author":"Chee Jerry","year":"2023","unstructured":"Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. 2023. QuIP: 2-bit quantization of large language models with guarantees. Retrieved from https:\/\/arXiv:2307.13304","journal-title":"Retrieved from"},{"key":"e_1_3_3_9_2","article-title":"Accelerating large language model decoding with speculative sampling","author":"Chen Charlie","year":"2023","unstructured":"Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling. Retrieved from https:\/\/arXiv:2302.01318","journal-title":"Retrieved from"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640399"},{"key":"e_1_3_3_11_2","article-title":"Evaluating large language models trained on code","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Retrieved from ttps:\/\/arXiv:2107.03374","journal-title":"Retrieved from ttps:\/\/arXiv:2107.03374"},{"key":"e_1_3_3_12_2","unstructured":"Wei-Lin Chiang Zhuohan Li Zi Lin Ying Sheng Zhanghao Wu Hao Zhang Lianmin Zheng Siyuan Zhuang Yonghao Zhuang Joseph E. Gonzalez Ion Stoica and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. Retrieved from https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.022071131"},{"key":"e_1_3_3_14_2","volume-title":"Retrieved from","author":"Cohen Aaron Daniel","year":"2022","unstructured":"Aaron Daniel Cohen, Adam Roberts, Alejandra Molina, Alena Butryna, Alicia Jin, Apoorv Kulshreshtha, Ben Hutchinson, Ben Zevenbergen, Blaise Hilary Aguera-Arcas, Chung ching Chang, Claire Cui, Cosmo Du, Daniel De Freitas Adiwardana, Dehao Chen, Dmitry (Dima) Lepikhin, Ed H. Chi, Erin Hoffman-John, Heng-Tze Cheng, Hongrae Lee, Igor Krivokon, James Qin, Jamie Hall, Joe Fenton, Johnny Soraker, Kathy Meier-Hellstern, Kristen Olson, Lora Mois Aroyo, Maarten Paul Bosma, Marc Joseph Pickett, Marcelo Amorim Menegali, Marian Croak, Mark D\u00edaz, Matthew Lamm, Maxim Krikun, Meredith Ringel Morris, Noam Shazeer, Quoc V. Le, Rachel Bernstein, Ravi Rajakumar, Ray Kurzweil, Romal Thoppilan, Steven Zheng, Taylor Bos, Toju Duke, Tulsee Doshi, Vincent Y. Zhao, Vinodkumar Prabhakaran, Will Rusch, YaGuang Li, Yanping Huang, Yanqi Zhou, Yuanzhong Xu, and Zhifeng Chen. 2022. LaMDA: Language models for dialog applications. Retrieved from https:\/\/arXiv:2201.08239"},{"key":"e_1_3_3_15_2","article-title":"FlashAttention: Fast and memory-efficient exact attention with IO-awareness","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Retrieved from https:\/\/arXiv:2205.14135","journal-title":"Retrieved from"},{"key":"e_1_3_3_16_2","volume-title":"Advances in Neural Information Processing Systems","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.)."},{"key":"e_1_3_3_17_2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https:\/\/arXiv:1810.04805","journal-title":"Retrieved from"},{"key":"e_1_3_3_18_2","volume-title":"Retrieved from","author":"Driess Danny","year":"2023","unstructured":"Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An embodied multimodal language model. In Retrieved from https:\/\/arXiv:2303.03378"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502368"},{"key":"e_1_3_3_20_2","article-title":"kernl.ai","year":"2022","unstructured":"ELS-RD. 2022. kernl.ai. Retrieved from https:\/\/github.com\/ELS-RD\/kernl","journal-title":"R"},{"key":"e_1_3_3_21_2","unstructured":"Farah Fahim Benjamin Hawks Christian Herwig James Hirschauer Sergo Jindariani Nhan Tran Luca P. Carloni Giuseppe Di Guglielmo Philip Harris Jeffrey Krupa Dylan Rankin Manuel Blanco Valentin Josiah Hester Yingyi Luo John Mamish Seda Orgrenci-Memik Thea Aarrestad Hamza Javed Vladimir Loncar Maurizio Pierini Adrian Alan Pol Sioni Summers Javier Duarte Scott Hauck Shih-Chieh Hsu Jennifer Ngadiuba Mia Liu Duc Hoang Edward Kreinar and Zhenbin Wu. 2021. hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices. https:\/\/arxiv.org\/abs\/2103.05579"},{"key":"e_1_3_3_22_2","article-title":"GPTQ: Accurate post-training compression for generative pretrained transformers","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate post-training compression for generative pretrained transformers. Retrieved from https:\/\/arXiv:2210.17323","journal-title":"Retrieved from"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439289"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00051"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2021.3123465"},{"key":"e_1_3_3_26_2","article-title":"Text Generation Strategies","year":"2023","unstructured":"HuggingFace. 2023. Text Generation Strategies. Retrieved from https:\/\/huggingface.co\/docs\/transformers\/generation_strategies","journal-title":"Retrieved from"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3564606"},{"key":"e_1_3_3_28_2","article-title":"Intel Agilex 7 FPGA and SoC FPGA","year":"2022","unstructured":"Intel. 2022. Intel Agilex 7 FPGA and SoC FPGA. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/products\/details\/fpga\/agilex\/7.html","journal-title":"R"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439477"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM57271.2023.00011"},{"key":"e_1_3_3_31_2","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201921)","author":"Kim Sehoon","year":"2021","unstructured":"Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. I-BERT: Integer-only BERT quantization. In Proceedings of the International Conference on Machine Learning (ICML\u201921)."},{"key":"e_1_3_3_32_2","unstructured":"Sehoon Kim Coleman Hooper Amir Gholami Zhen Dong Xiuyu Li Sheng Shen Michael Mahoney and Kurt Keutzer. 2023. SqueezeLLM: Dense-and-sparse quantization. Retrieved from https:\/\/arxiv.org\/abs\/2306.07629"},{"key":"e_1_3_3_33_2","article-title":"Full stack optimization of transformer inference: A survey","author":"Kim Sehoon","year":"2023","unstructured":"Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, and Amir Gholami. 2023. Full stack optimization of transformer inference: A survey. Retrieved from https:\/\/arXiv:2302.14017","journal-title":"R"},{"key":"e_1_3_3_34_2","article-title":"Reducing activation recomputation in large transformer models","author":"Korthikanti Vijay","year":"2022","unstructured":"Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing activation recomputation in large transformer models. Retrieved from https:\/\/arXiv:2205.05198","journal-title":"Retrieved from"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293910"},{"key":"e_1_3_3_37_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Lan Zhenzhong","year":"2020","unstructured":"Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439293"},{"key":"e_1_3_3_39_2","article-title":"xFormers: A Modular and Hackable Transformer Modelling Library","author":"Lefaudeux Benjamin","year":"2022","unstructured":"Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. 2022. xFormers: A Modular and Hackable Transformer Modelling Library. Retrieved from https:\/\/github.com\/facebookresearch\/xformers","journal-title":"R"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406567"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3047371"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415530"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.abq1158"},{"key":"e_1_3_3_44_2","first-page":"663","volume-title":"Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201923)","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201923). USENIX Association, Boston, MA, 663\u2013679. Retrieved from DOI:https:\/\/www.usenix.org\/conference\/osdi23\/presentation\/li-zhouhan"},{"key":"e_1_3_3_45_2","article-title":"AWQ: Activation-aware weight quantization for LLM compression and acceleration","author":"Lin Ji","year":"2023","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware weight quantization for LLM compression and acceleration. Retrieved from https:\/\/arXiv:2306.00978","journal-title":"Retrieved from"},{"key":"e_1_3_3_46_2","article-title":"Roberta: A robustly optimized bert pretraining approach","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. Retrieved from https:\/\/arXiv:1907.11692","journal-title":"Retrieved from"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE51398.2021.9474043"},{"key":"e_1_3_3_48_2","article-title":"Fully Sharded Data Parallel: Faster AI Training with Fewer GPUs","year":"2021","unstructured":"Meta. 2021. Fully Sharded Data Parallel: Faster AI Training with Fewer GPUs. Retrieved from https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","journal-title":"R"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_3_51_2","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel DNN training. In Proceedings of the 38th International Conference on Machine Learning."},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_3_3_53_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Nijkamp Erik","year":"2023","unstructured":"Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An open large language model for code with multi-turn program synthesis. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_3_54_2","article-title":"FasterTransformer","year":"2022","unstructured":"Nvidia. 2022. FasterTransformer. Retrieved from https:\/\/github.com\/NVIDIA\/FasterTransformer","journal-title":"R"},{"key":"e_1_3_3_55_2","article-title":"GPT-4 technical report","year":"2023","unstructured":"OpenAI. 2023. GPT-4 technical report. Retrieved from https:\/\/arXiv:2303.08774","journal-title":"Retrieved from"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530681"},{"key":"e_1_3_3_57_2","article-title":"The LAMBADA dataset: Word prediction requiring a broad discourse context","author":"Paperno Denis","year":"2016","unstructured":"Denis Paperno, Germ\u00e1n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and R. Fern\u00e1ndez. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. Retrieved from https:\/\/arXiv:1606.06031","journal-title":"Retrieved from"},{"key":"e_1_3_3_58_2","volume-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems."},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530585"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISQED51717.2021.9424344"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00016"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL57034.2022.00015"},{"key":"e_1_3_3_63_2","volume-title":"Proceedings of Machine Learning and Systems","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems."},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643586"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3453688.3461739"},{"issue":"8","key":"e_1_3_3_66_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6409"},{"key":"e_1_3_3_69_2","article-title":"Megatron-LM: Training multi-billion parameter language models using model parallelism","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. Retrieved from https:\/\/arXiv:1909.08053","journal-title":"Retrieved from"},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530618"},{"key":"e_1_3_3_71_2","article-title":"LLaMA: Open and efficient foundation language models","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and efficient foundation language models. Retrieved from https:\/\/arXiv:2302.13971.","journal-title":"Retrieved from"},{"key":"e_1_3_3_72_2","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Retrieved from https:\/\/arXiv:2307.09288","journal-title":"Retrieved from"},{"key":"e_1_3_3_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021744"},{"key":"e_1_3_3_74_2","first-page":"267","volume-title":"Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922)","author":"Unger Colin","year":"2022","unstructured":"Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922). USENIX Association, Carlsbad, CA, 267\u2013284. Retrieved from DOI:https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/unger"},{"key":"e_1_3_3_75_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_3_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"key":"e_1_3_3_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3567955.3567959"},{"key":"e_1_3_3_78_2","article-title":"Emergent abilities of large language models","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent abilities of large language models. Trans. Mach. Learn. Res. (2022). Retrieved from DOI:https:\/\/openreview.net\/forum?id=yzkSU5zdwD","journal-title":"Trans. Mach. Learn. Res."},{"key":"e_1_3_3_79_2","first-page":"24824","volume-title":"Advances in Neural Information Processing Systems","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, 24824\u201324837."},{"key":"e_1_3_3_80_2","article-title":"Huggingface\u2019s transformers: State-of-the-art natural language processing","author":"Wolf Thomas","year":"2019","unstructured":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz et\u00a0al. 2019. Huggingface\u2019s transformers: State-of-the-art natural language processing. Retrieved from https:\/\/arXiv:1910.03771","journal-title":"Retrieved from"},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502369"},{"key":"e_1_3_3_82_2","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning."},{"key":"e_1_3_3_83_2","first-page":"548","volume-title":"Proceedings of Machine Learning and Systems","volume":"4","author":"Xie Ningning","year":"2022","unstructured":"Ningning Xie, Tamara Norman, Dominik Grewe, and Dimitrios Vytiniotis. 2022. Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4. 548\u2013566. Retrieved from DOI:https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2022\/file\/f0f9e98bc2e2f0abc3e315eaa0d808fc-Paper.pdf"},{"key":"e_1_3_3_84_2","article-title":"Alveo U280 Data Center Accelerator Card","author":"Xilinx AMD","year":"2021","unstructured":"AMD Xilinx. 2021. Alveo U280 Data Center Accelerator Card. Retrieved from https:\/\/www.xilinx.com\/products\/boards-and-kits\/alveo\/u280.html#specifications","journal-title":"R"},{"key":"e_1_3_3_85_2","volume-title":"AI Engines and Their Applications","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. AI Engines and Their Applications. White Paper. AMD Xilinx."},{"key":"e_1_3_3_86_2","unstructured":"AMD Xilinx. 2022. QSFP Module Connector. Retrieved from DOI:https:\/\/docs.xilinx.com\/r\/en-US\/ug1411-vmk180-eval-bd\/QSFP-Module-Connector"},{"key":"e_1_3_3_87_2","article-title":"VCK5000 Versal Development Card","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. VCK5000 Versal Development Card. Retrieved from https:\/\/www.xilinx.com\/products\/boards-and-kits\/vck5000.html#specs","journal-title":"R"},{"key":"e_1_3_3_88_2","article-title":"Vitis Accelerated Libraries","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. Vitis Accelerated Libraries. Retrieved from https:\/\/github.com\/Xilinx\/Vitis_Libraries","journal-title":"R"},{"key":"e_1_3_3_89_2","article-title":"Vitis AI: Adaptable & Real-Time AI Inference Acceleration","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. Vitis AI: Adaptable & Real-Time AI Inference Acceleration. Retrieved from https:\/\/github.com\/Xilinx\/Vitis-AI","journal-title":"R"},{"key":"e_1_3_3_90_2","article-title":"Vitis HLS v2022.1","author":"Xilinx AMD","year":"2022","unstructured":"AMD Xilinx. 2022. Vitis HLS v2022.1. Retrieved from https:\/\/www.xilinx.com\/products\/design-tools\/vitis\/vitis-platform.html","journal-title":"R"},{"key":"e_1_3_3_91_2","article-title":"Versal VHK158","author":"Xilinx AMD","year":"2023","unstructured":"AMD Xilinx. 2023. Versal VHK158. Retrieved from https:\/\/www.xilinx.com\/products\/boards-and-kits\/vhk158.html","journal-title":"R"},{"key":"e_1_3_3_92_2","volume-title":"Proceedings of the 37th International Conference on Machine Learning (ICML\u201920)","author":"Xiong Ruibin","year":"2020","unstructured":"Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. 2020. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML\u201920). JMLR.org, Article 975, 10 pages."},{"key":"e_1_3_3_93_2","volume-title":"Proceedings of Machine Learning and Systems","author":"Yang Bowen","year":"2021","unstructured":"Bowen Yang, Jian Zhang, Jonathan Li, Christopher Re, Christopher Aberger, and Christopher De Sa. 2021. PipeMare: Asynchronous pipeline parallel DNN training. In Proceedings of Machine Learning and Systems."},{"key":"e_1_3_3_94_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD57390.2023.10323754"},{"key":"e_1_3_3_95_2","first-page":"27168","article-title":"Zeroquant: Efficient and affordable post-training quantization for large-scale transformers","volume":"35","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Info. Process. Syst. 35 (2022), 27168\u201327183.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_3_96_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00060"},{"key":"e_1_3_3_97_2","article-title":"RPTQ: Reorder-based post-training quantization for large language models","author":"Yuan Zhihang","year":"2023","unstructured":"Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. 2023. RPTQ: Reorder-based post-training quantization for large language models. Retrieved from https:\/\/arXiv:2304.01089","journal-title":"Retrieved from"},{"key":"e_1_3_3_98_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477002"},{"key":"e_1_3_3_101_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415609"},{"key":"e_1_3_3_102_2","article-title":"Binarized neural machine translation","author":"Zhang Yichi","year":"2023","unstructured":"Yichi Zhang, Ankush Garg, Yuan Cao, \u0141ukasz Lew, Behrooz Ghorbani, Zhiru Zhang, and Orhan Firat. 2023. Binarized neural machine translation. Retrieved from https:\/\/arXiv:2302.04907","journal-title":"Retrieved from"},{"key":"e_1_3_3_103_2","article-title":"FracBNN: Accurate and FPGA-efficient binary neural networks with fractional activations","author":"Zhang Yichi","year":"2021","unstructured":"Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, and Zhiru Zhang. 2021. FracBNN: Accurate and FPGA-efficient binary neural networks with fractional activations. In Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays.","journal-title":"Proceedings of the ACM\/SIGDA International Symposium on Field-Programmable Gate Arrays"},{"key":"e_1_3_3_104_2","article-title":"Atom: Low-bit quantization for efficient and accurate LLM serving","author":"Zhao Yilong","year":"2023","unstructured":"Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2023. Atom: Low-bit quantization for efficient and accurate LLM serving. Retrieved from https:\/\/arXiv:2310.19102","journal-title":"Retrieved from"},{"key":"e_1_3_3_105_2","unstructured":"Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zi Lin Zhuohan Li Dacheng Li Eric. P Xing Hao Zhang Joseph E. Gonzalez and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Retrieved from https:\/\/arxiv:2306.05685"},{"key":"e_1_3_3_106_2","first-page":"559","volume-title":"Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201922). USENIX Association, 559\u2013578. Retrieved from DOI:https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/zheng-lianmin"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656177","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3656177","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:09:33Z","timestamp":1750295373000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656177"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,17]]},"references-count":105,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3656177"],"URL":"https:\/\/doi.org\/10.1145\/3656177","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,17]]},"assertion":[{"value":"2023-12-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-19","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}