{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T11:24:22Z","timestamp":1773573862714,"version":"3.50.1"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>The rapid advancement of generative AI and exponential growth of model parameters has driven the pursuit of alternative arithmetic formats to enhance efficiency while maintaining inference accuracy. While earlier efforts to implement small floating point formats (minifloats) have been somewhat ad-hoc, the Microscaling MX formats proposed by the Open Compute Project offer a standard around which hardware designers can converge. This article explores the design space of systolic array architectures for Microscaling MX formats on FPGAs. It provides a detailed analysis of the area and timing characteristics of systolic arrays implemented on AMD UltraScale+ FPGAs for 6-bit and 8-bit MX formats, exploring different accumulation and pipelining strategies while maintaining high computational throughput. Our most optimized design reaches a peak throughput of 568 GOPS using 4-stage and 3-stage pipelined Exact accumulation for MXFP6 E3M2 and E2M3, respectively.<\/jats:p>","DOI":"10.1145\/3773041","type":"journal-article","created":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T14:01:55Z","timestamp":1761314515000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploring Microscaling MX Minifloat Systolic Arrays on\u00a0FPGAs"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-5852-067X","authenticated-orcid":false,"given":"Abdurauf","family":"Abdurakhmanov","sequence":"first","affiliation":[{"name":"King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0568-5048","authenticated-orcid":false,"given":"Suhaib A.","family":"Fahmy","sequence":"additional","affiliation":[{"name":"King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia"}]}],"member":"320","published-online":{"date-parts":[[2025,12,4]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"GitHub Inc. 2025. MX Systolic Array FPGA Repository. Retrieved from https:\/\/github.com\/accl-kaust\/mx-systolic-fpga"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"Shivam Aggarwal Hans Jakob Damsgaard Alessandro Pappalardo Giuseppe Franco Thomas B. Preu\u00dfer Michaela Blott and Tulika Mitra. 2024. Shedding the bits: Pushing the boundaries of quantization with minifloats on FPGAs. arXiv:2311.12359. Retrieved from https:\/\/arxiv.org\/abs\/2311.12359","DOI":"10.1109\/FPL64840.2024.00048"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.7873\/DATE.2013.053"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","unstructured":"Hongzheng Chen Jiahao Zhang Yixiao Du Shaojie Xiang Zichao Yue Niansong Zhang Yaohui Cai and Zhiru Zhang. 2025. Understanding the potential of FPGA-based spatial acceleration for large language model inference. ACM Transactions on Reconfigurable Technology and Systems 18 1 (2025) 1\u201329. DOI: 10.1145\/3656177","DOI":"10.1145\/3656177"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","unstructured":"Mannhee Cho and Youngmin Kim. 2021. FPGA-based convolutional neural network accelerator with resource-optimized approximate multiply-accumulate unit. Electronics 10 22 (2021) 2859. DOI: 10.3390\/electronics10222859","DOI":"10.3390\/electronics10222859"},{"key":"e_1_3_1_7_2","doi-asserted-by":"crossref","unstructured":"Stef Cuyckens Xiaoling Yi Nitish Satya Murthy Chao Fang and Marian Verhelst. 2025. Efficient precision-scalable hardware for microscaling (MX) processing in robotics learning. arXiv:2505.22404. Retrieved from https:\/\/arxiv.org\/abs\/2505.22404","DOI":"10.1109\/ISLPED65674.2025.11261796"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2024.3511343"},{"key":"e_1_3_1_9_2","volume-title":"Advances in Neural Information Processing Systems","author":"Rouhani Bita Darvish","year":"2020","unstructured":"Bita Darvish Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, et al. 2020. Pushing the limits of narrow precision inferencing at cloud scale with Microsoft floating point. In Advances in Neural Information Processing Systems. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSD60849.2023.00093"},{"key":"e_1_3_1_11_2","article-title":"GPT3.int8(): 8-bit matrix multiplication for transformers at scale","volume":"35","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, Vol. 35.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071047"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/AICAS57966.2023.10168556"},{"key":"e_1_3_1_14_2","unstructured":"Danila Gorodecky and Leonel Sousa. 2024. Hardware for converting floating-point to the microscaling (MX) format. arXiv:2411.03149. Retrieved from https:\/\/arxiv.org\/abs\/2411.03149"},{"key":"e_1_3_1_15_2","unstructured":"Jude Haris Rappy Saha Wenhao Hu and Jos\u00e9 Cano. 2024. Designing efficient LLM accelerators for edge devices. arXiv:2408.00462. Retrieved from https:\/\/arxiv.org\/abs\/2408.00462"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2020.2987202"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3508352.3549374"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_1_19_2","unstructured":"Dhiraj Kalamkar Dheevatsa Mudigere Naveen Mellempudi Dipankar Das Kunal Banerjee Sasikanth Avancha Dharma Teja Vooturi Nataraj Jammalamadaka Jianyu Huang Hector Yuen et al. 2019. A study of BFLOAT16 for deep learning training. arXiv:1905.12322. Retrieved from https:\/\/arxiv.org\/abs\/1905.12322"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1515\/9783110301793"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00607-010-0127-7"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650200.3656622"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ARITH61463.2024.00016"},{"key":"e_1_3_1_24_2","unstructured":"Paulius Micikevicius Dusan Stosic Neil Burgess Marius Cornea Pradeep Dubey Richard Grisenthwaite Sangwon Ha Alexander Heinecke Patrick Judd John Kamalu et al. 2022. FP8 formats for deep learning. arXiv:2209.05433. Retrieved from https:\/\/arxiv.org\/abs\/2209.05433"},{"key":"e_1_3_1_25_2","unstructured":"Microsoft. 2023. MicroXcaling: MX PyTorch Emulation Library. Retrieved from https:\/\/github.com\/microsoft\/microxcaling"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSTCC59206.2023.10308496"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640364"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","unstructured":"Sergio P. Perez Yan Zhang James Briggs Charlie Blake Josh Levy-Kramer Paul Balanca Carlo Luschi Stephen Barlow and Andrew William Fitzgibbon. 2023. Training and inference of large language models using 8-bit floating point. arXiv:2309.17224. DOI: 10.48550\/arXiv.2309.17224","DOI":"10.48550\/arXiv.2309.17224"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2474363"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2016.2629421"},{"key":"e_1_3_1_31_2","unstructured":"Bita Darvish Rouhani Nitin Garegrat Tom Savell Ankit More Kyung-Nam Han Ritchie Zhao Mathew Hall Jasmine Klar Eric Chung Yuan Yu et al. 2023. OCP Microscaling Formats (MX) Specification v1.0. Open Compute Project. Retrieved from https:\/\/www.opencompute.org\/documents\/ocp-microscaling-formats-mx-v1-0-spec-final-pdf"},{"key":"e_1_3_1_32_2","unstructured":"Bita Darvish Rouhani Ritchie Zhao Ankit More Mathew Hall Alireza Khodamoradi Summer Deng Dhruv Choudhary Marius Cornea Eric Dellinger Kristof Denolf et al. 2023. Microscaling data formats for deep learning. arXiv:2310.10537. Retrieved from https:\/\/arxiv.org\/abs\/2310.10537"},{"key":"e_1_3_1_33_2","unstructured":"E. Samson. 2023. MX-for-FPGA. Retrieved from https:\/\/github.com\/ebby-s\/MX-for-FPGA"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL64840.2024.00049"},{"key":"e_1_3_1_35_2","unstructured":"Mart van Baalen Andrey Kuzmin Suparna S. Nair Yuwei Ren Eric Mahurin Chirag Patel Sundar Subramanian Sanghyuk Lee Markus Nagel Joseph Soriaga et al. 2023. FP8 versus INT8 for efficient deep learning inference. arXiv:2303.17951. Retrieved from https:\/\/arxiv.org\/abs\/2303.17951"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.5120\/3084-4222"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_3_1_39_2","volume-title":"Rounding Errors in Algebraic Processes","author":"Wilkinson James Hardy","year":"1963","unstructured":"James Hardy Wilkinson. 1963. Rounding Errors in Algebraic Processes. SIAM."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW63119.2024.00045"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL64840.2024.00044"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10247773"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2019.8702071"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773041","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T10:42:04Z","timestamp":1773571324000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773041"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,4]]},"references-count":43,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3773041"],"URL":"https:\/\/doi.org\/10.1145\/3773041","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,4]]},"assertion":[{"value":"2025-06-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-06","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}