{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T02:10:08Z","timestamp":1773281408613,"version":"3.50.1"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2023YFB4502200"],"award-info":[{"award-number":["2023YFB4502200"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62325405, 62104128, 62203257, 62031017, 62406159, U21B2031"],"award-info":[{"award-number":["62325405, 62104128, 62203257, 62031017, 62406159, U21B2031"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Tsinghua University Initiative Scientific Research Program, Tsinghua EE Xilinx AI Research Fund, Tsinghua-Meituan Joint Institute for Digital Life"},{"name":"EAI Computation and Perception, Beijing National Research Center for Information Science, Technology","award":["BNR2024TD03001"],"award-info":[{"award-number":["BNR2024TD03001"]}]},{"name":"Beijing Innovation Center for Future Chips, and State Key laboratory of Space Network and Communications"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>\n                    Large Language Models (LLMs) with 70 billion or more parameters are increasingly being deployed in cloud-based Model-as-a-Service (MaaS) scenarios. To meet the demands of such deployments, MaaS providers require batched LLM decoding systems that can deliver high System Throughput (STP) while minimizing Total Cost of Ownership (TCO). However, existing FPGA-based solutions predominantly focus on small-batch or single-batch inference, which fails to meet the computational requirements of batched LLM decoding, resulting in performance gaps of up to 7.96\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    . Moreover, the low utilization of multi-head attention operations in batched decoding scenarios, e.g., only 3.72% on A100 GPUs, further constrains throughput and inflates TCO.\n                  <\/jats:p>\n                  <jats:p>\n                    To address these challenges, this article introduces\n                    <jats:italic toggle=\"yes\">CD-LLM<\/jats:italic>\n                    , a heterogeneous multi-FPGA system designed for efficient batched decoding of LLMs with 70B+ parameters, built upon a\n                    <jats:italic toggle=\"yes\">C<\/jats:italic>\n                    ompute-\n                    <jats:italic toggle=\"yes\">D<\/jats:italic>\n                    edicated architecture. First, we propose a memory-aligned mixed-precision quantization engine to reduce workload. By employing importance-aware quantization, we compress Llama-3.1-70B to an effective 3.45-bit representation and achieve 72.33% bandwidth utilization through memory-aligned data packing. Second, we present a compute-dedicated FPGA architecture that maximizes peak performance by leveraging FPGA-specific resources such as DSPs, BRAMs, and LUTs. The compute-dedicated architecture enables CD-LLM to reach a peak performance of 59.90 TOPS at 600\u2009MHz on U250 FPGA. At last, we introduce a heterogeneous master-slave multi-FPGA system to achieve higher utilization. By pipelining attention and linear layer computations across master and slave FPGAs, CD-LLM achieves utilization rates of 83.08% for linear layers and 68.30% for attention layers.\n                  <\/jats:p>\n                  <jats:p>\n                    CD-LLM is designed with a heterogeneous multi-FPGA architecture, with an HBM-enabled FPGA as the master accelerator and eight DDR-based FPGAs as slave accelerators. When deployed for inference on the Llama-3.1-70B model with a batch size of 256, CD-LLM achieves a throughput of 2,721.79 tokens\/s. This represents a 6.11\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    improvement in STP and a 4.71\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    reduction in TCO compared to an eight-card RTX3090 GPU system. Furthermore, CD-LLM substantially outperforms the state-of-the-art eight-card FPGA accelerator FlightLLM, delivering 16.15\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    higher STP and 14.56\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    lower TCO.\n                  <\/jats:p>","DOI":"10.1145\/3771288","type":"journal-article","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T16:07:14Z","timestamp":1761149234000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["CD-LLM: A Heterogeneous Multi-FPGA System for Batched Decoding of 70B+ LLMs Using a Compute-Dedicated Architecture"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2349-7286","authenticated-orcid":false,"given":"Wenheng","family":"Ma","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-9739-2930","authenticated-orcid":false,"given":"Xinhao","family":"Yang","sequence":"additional","affiliation":[{"name":"Electronic Engineering, Tsinghua University, Beijing, China and Infinigence-AI, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1030-3748","authenticated-orcid":false,"given":"Shulin","family":"Zeng","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8943-1577","authenticated-orcid":false,"given":"Tengxuan","family":"Liu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China and Infinigence-AI, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1717-2093","authenticated-orcid":false,"given":"Libo","family":"Shen","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-7095-7963","authenticated-orcid":false,"given":"Hongyi","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-4755-6881","authenticated-orcid":false,"given":"Shiyao","family":"Li","sequence":"additional","affiliation":[{"name":"Electronic Engineering, Tsinghua University, Beijing, China and Infinigence-AI, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5768-6037","authenticated-orcid":false,"given":"Ke","family":"Hong","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China and Infinigence-AI, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9259-7180","authenticated-orcid":false,"given":"Zhenhua","family":"Zhu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2209-8312","authenticated-orcid":false,"given":"Xuefei","family":"Ning","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7348-5625","authenticated-orcid":false,"given":"Tsung-Yi","family":"Ho","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0849-3252","authenticated-orcid":false,"given":"Guohao","family":"Dai","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China and Infinigence-AI, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6108-5157","authenticated-orcid":false,"given":"Yu","family":"Wang","sequence":"additional","affiliation":[{"name":"Electronic Engineering, Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,3,6]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Siyuan Chen Mengyue Wu Kenny Q. Zhu Kunyao Lan Zhiling Zhang and Lyuchun Cui. 2023. Llm-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv:2305.13614. Retrieved from https:\/\/arxiv.org\/abs\/2305.13614"},{"key":"e_1_3_2_3_2","unstructured":"Chaozheng Wang Junhao Hu Cuiyun Gao Yu Jin Tao Xie Hailiang Huang Zhenyu Lei and Yuetang Deng. 2023. Practitioners\u2019 expectations on code completion. arXiv:2301.03846. Retrieved from https:\/\/arxiv.org\/abs\/2301.03846"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41591-023-02448-8"},{"key":"e_1_3_2_5_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv:2001.08361. Retrieved from https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_2_6_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et al. 2023. A survey of large language models. arXiv:2303.18223. Retrieved from https:\/\/arxiv.org\/abs\/2303.18223"},{"key":"e_1_3_2_7_2","unstructured":"Shervin Minaee Tomas Mikolov Narjes Nikzad Meysam Chenaghlu Richard Socher Xavier Amatriain and Jianfeng Gao. 2024. Large language models: A survey. arXiv:2402.06196. Retrieved from https:\/\/arxiv.org\/abs\/2402.06196"},{"key":"e_1_3_2_8_2","unstructured":"Together.AI. 2024. Together.ai products: Inference. Retrieved from https:\/\/www.together.ai\/"},{"key":"e_1_3_2_9_2","unstructured":"Amazon Web Services. 2024. Amazon bedrock: Foundation models at scale. Retrieved from https:\/\/aws.amazon.com\/cn\/bedrock\/"},{"key":"e_1_3_2_10_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_11_2","unstructured":"xAI. 2024. Announcing grok-1.5. Retrieved from https:\/\/ai-x.chat\/models\/grok1-5\/"},{"key":"e_1_3_2_12_2","unstructured":"An Yang Baosong Yang Binyuan Hui Bo Zheng Bowen Yu Chang Zhou Chengpeng Li Chengyuan Li Dayiheng Liu Fei Huang et al. 2024. Qwen2 technical report. arXiv:2407.10671. Retrieved from https:\/\/arxiv.org\/abs\/2407.10671"},{"key":"e_1_3_2_13_2","unstructured":"Xuefei Ning Zinan Lin Zixuan Zhou Zifu Wang Huazhong Yang and Yu Wang. 2023. Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. arXiv:2307.15337. Retrieved from https:\/\/arxiv.org\/abs\/2307.15337"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_15_2","unstructured":"Amey Agrawal Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav S. Gulavani and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv:2308.16369. Retrieved from https:\/\/arxiv.org\/abs\/2308.16369"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3672198.3673797"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021745"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502357"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530420"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502368"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM57271.2023.00023"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637557"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL57034.2022.00027"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071047"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637569"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3687480"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00051"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626202.3637562"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3656177"},{"key":"e_1_3_2_30_2","unstructured":"Jinhao Li Jiaming Xu Shan Huang Yonghua Chen Wen Li Jun Liu Yaoxiu Lian Jiayi Pan Li Ding Hao Zhou and Guohao Dai. 2024. Large language model inference acceleration: A comprehensive hardware perspective. arXiv:2410.04466. Retrieved from https:\/\/arxiv.org\/abs\/2410.04466"},{"key":"e_1_3_2_31_2","unstructured":"Ji Lin Jiaming Tang Haotian Tang Shang Yang Wei-Ming Chen Wei-Chen Wang Guangxuan Xiao Xingyu Dang Chuang Gan and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024) 87\u2013100."},{"key":"e_1_3_2_32_2","unstructured":"AMD. 2021. Alveo u280 data center accelerator card data sheet. Retrieved from https:\/\/docs.amd.com\/r\/en-US\/ds963-u280"},{"key":"e_1_3_2_33_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin et al. 2022. Opt: Open pre-trained transformer language models. arXiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"},{"key":"e_1_3_2_34_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_35_2","unstructured":"Yinmin Zhong Shengyu Liu Junda Chen Jianbo Hu Yibo Zhu Xuanzhe Liu Xin Jin and Hao Zhang. 2024. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv:2401.09670. Retrieved from https:\/\/arxiv.org\/abs\/2401.09670"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Ruoyu Qin Zheming Li Weiran He Mingxing Zhang Yongwei Wu Weimin Zheng and Xinran Xu. 2024. Mooncake: Kimi\u2019s kvcache-centric architecture for llm serving. arXiv:2407.00079. Retrieved from https:\/\/arxiv.org\/abs\/2407.00079","DOI":"10.1145\/3773772"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_38_2","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et al. 2024. The llama 3 herd of models. arXiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_39_2","unstructured":"Nvidia. 2024. Nvidia h200 tensor core gpu data sheet. Retrieved from https:\/\/nvdam.widen.net\/s\/nb5zzzsjdf\/hpc-datasheet-sc23-h200-datasheet-3002446"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_41_2","first-page":"38087","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 38087\u201338099."},{"key":"e_1_3_2_42_2","unstructured":"Changhun Lee Jungyu Jin Taesu Kim Hyungjun Kim and Eunhyeok Park. 2023. Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv:2306.02272. Retrieved from https:\/\/arxiv.org\/abs\/2306.02272"},{"key":"e_1_3_2_43_2","first-page":"196","volume-title":"Proceedings of the 6th Machine Learning and Systems","author":"Zhao Yilong","year":"2024","unstructured":"Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. In Proceedings of the 6th Machine Learning and Systems, 196\u2013209."},{"key":"e_1_3_2_44_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2023. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323. https:\/\/arxiv.org\/abs\/2210.17323"},{"key":"e_1_3_2_45_2","doi-asserted-by":"crossref","unstructured":"Tim Dettmers Mike Lewis Younes Belkada and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022) 30318\u201330332.","DOI":"10.52202\/068431-2198"},{"key":"e_1_3_2_46_2","first-page":"27168","volume-title":"Advances in Neural Information Processing Systems","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 27168\u201327183."},{"key":"e_1_3_2_47_2","unstructured":"Yujun Lin Haotian Tang Shang Yang Zhekai Zhang Guangxuan Xiao Chuang Gan and Song Han. 2024. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv:2405.04532. Retrieved from https:\/\/arxiv.org\/abs\/2405.04532"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2024.3524356"},{"key":"e_1_3_2_49_2","unstructured":"Iz Beltagy Matthew E. Peters and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https:\/\/arxiv.org\/abs\/2004.05150"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572848.3577500"},{"key":"e_1_3_2_51_2","unstructured":"Rewon Child Scott Gray Alec Radford and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https:\/\/arxiv.org\/abs\/1904.10509"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00018"},{"key":"e_1_3_2_53_2","unstructured":"Manzil Zaheer Guru Guruganesh Avinava Dubey Joshua Ainslie Chris Alberti Santiago Ontanon Philip Pham Anirudh Ravula Qifan Wang Li Yang and Amr Ahmed. 2020. Big bird: Transformers for longer sequences.Advances in Neural Information Processing Systems 33 (2020) 17283\u201317297."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589057"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480125"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406567"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575747"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00060"},{"key":"e_1_3_2_59_2","first-page":"328","volume-title":"020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","author":"Ham Tae Jun","year":"2020","unstructured":"Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, and Deog-Kyoon Jeong. 2020. A3: Accelerating attention mechanisms in neural networks with approximation. In 2 020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 328\u2013341."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00050"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640383"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543622.3573182"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.00394"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3656401"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM57271.2023.00015"},{"key":"e_1_3_2_66_2","unstructured":"AMD. 2024. Aurora 64b\/66b v13.0 logicore ip product guide. Retrieved from https:\/\/docs.amd.com\/r\/en-US\/pg074-aurora-64b66b"},{"key":"e_1_3_2_67_2","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano et al. 2021. Training verifiers to solve math word problems. arXiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/2110.14168"},{"key":"e_1_3_2_68_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde De Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_2_69_2","unstructured":"HyperAccel. 2023. Hyper-accelerated hardware solutions for emerging ai applications. Retrieved from https:\/\/hyperaccel.ai\/"},{"key":"e_1_3_2_70_2","unstructured":"NVIDIA. 2023. A tensorrt toolbox for optimized large language model inference. Retrieved from https:\/\/github.com\/NVIDIA\/TensorRT-LLM"},{"key":"e_1_3_2_71_2","unstructured":"Zeng Shulin Yang Xinhao Liu Jun Li Jingtao and Dai Yadong. 2024. Artifact evaluation for fpga 2024 paper #57. Retrieved from https:\/\/zenodo.org\/records\/10462167"},{"key":"e_1_3_2_72_2","unstructured":"AMD HyperAccel. 2024. Hyperaccel taps amd accelerator card and fpgas for new ai inference server. Retrieved from https:\/\/www.amd.com\/content\/dam\/amd\/en\/documents\/resources\/case-studies\/hyperaccel-case-study.pdf"},{"key":"e_1_3_2_73_2","unstructured":"ShareGPT Teams. 2023. Sharegpt. Retrieved from https:\/\/sharegpt.com\/"},{"key":"e_1_3_2_74_2","unstructured":"Yushi Bai Xin Lv Jiajie Zhang Hongchang Lyu Jiankai Tang Zhidian Huang Zhengxiao Du Xiao Liu Aohan Zeng Lei Hou et al. 2023. Longbench: A bilingual multitask benchmark for long context understanding. arXiv:2308.14508. Retrieved from https:\/\/arxiv.org\/abs\/2308.14508"},{"key":"e_1_3_2_75_2","unstructured":"Lenovo Press. 2025. On-premise vs. cloud: Generative ai total cost of ownership. Technical Report Lenovo Press."},{"key":"e_1_3_2_76_2","unstructured":"U.S. Energy Information Administration. 2025. Electric power monthly. Retrieved from https:\/\/www.eia.gov\/electricity\/monthly\/"},{"key":"e_1_3_2_77_2","unstructured":"Vu13p fpga price. 2025. Retrieved from https:\/\/www.win-source.net\/products\/detail\/xilinx-inc\/xilinx-inc.-xcvu13p-1fhgb2104e.html"},{"key":"e_1_3_2_78_2","unstructured":"Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv:2501.12948. Retrieved from https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"e_1_3_2_79_2","unstructured":"AMD. 2025. Versal ai edge series gen 2. Retrieved from https:\/\/www.amd.com\/en\/products\/adaptive-socs-and-fpgas\/versal\/gen2\/ai-edge-series.html"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3771288","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T11:16:51Z","timestamp":1773227811000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3771288"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,6]]},"references-count":78,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3771288"],"URL":"https:\/\/doi.org\/10.1145\/3771288","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,6]]},"assertion":[{"value":"2025-06-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-03-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}