{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T10:09:13Z","timestamp":1764842953427,"version":"3.41.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2025,3,21]],"date-time":"2025-03-21T00:00:00Z","timestamp":1742515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>\n            Time series forecasting is the problem of predicting future data samples from historical information and recent deep neural network (DNNs) based techniques have achieved excellent results compared with conventional statistical approaches. Many applications at the edge can utilize this technology and most implementations have focused on inference, an ability to train at the edge would enable the DNN to adapt to changing conditions. Unfortunately, training requires approximately three times more memory and computation than inference. Moreover, edge applications are often constrained by energy efficiency. In this work, we implement a block minifloat (BM) training accelerator for a time series prediction network, N-BEATS. Our architecture involves a mixed-precision GEMM accelerator that utilizes BM arithmetic. We use a 4-bit DSP packing scheme to optimize the implementation further, achieving a throughput of 779 Gops. The resulting power efficiency is 42.4 Gops\/W, 3.1\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            better than a graphics processing unit in a similar technology.\n          <\/jats:p>","DOI":"10.1145\/3707209","type":"journal-article","created":{"date-parts":[[2024,12,6]],"date-time":"2024-12-06T13:12:16Z","timestamp":1733490736000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["FPGA-based Block Minifloat Training Accelerator for a Time Series Prediction Network"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-7145-1639","authenticated-orcid":false,"given":"Wenjie","family":"Zhou","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, The University of Sydney, Sydney, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6173-0466","authenticated-orcid":false,"given":"Haoyan","family":"Qi","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, The University of Sydney, Sydney, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5370-4464","authenticated-orcid":false,"given":"David","family":"Boland","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, The University of Sydney, Sydney, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3923-3499","authenticated-orcid":false,"given":"Philip H. W.","family":"Leong","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, The University of Sydney, Sydney, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,3,21]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks","volume":"32","author":"Sun Xiao","unstructured":"Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d\u2019Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, Curran Associates, Inc.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_3_2","unstructured":"L\u00e9opold Cambier Anahita Bhiwandiwalla Ting Gong Mehran Nekuii Oguz H. Elibol and Hanlin Tang. 2020. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. arXiv:2001.05674. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2001.05674"},{"key":"e_1_3_2_4_2","first-page":"1796","article-title":"Ultra-low precision 4-bit training of deep neural networks","author":"Sun Xiao","year":"2020","unstructured":"Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srinivasan, and Kailash Gopalakrishnan. 2020. Ultra-low precision 4-bit training of deep neural networks. In Advances in Neural Information Processing Systems, 1796\u20131807.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2022.3202747"},{"key":"e_1_3_2_6_2","unstructured":"Paulius Micikevicius Dusan Stosic Neil Burgess Marius Cornea Pradeep Dubey Richard Grisenthwaite Sangwon Ha Alexander Heinecke Patrick Judd John Kamalu et al. 2022. Fp8 formats for deep learning. arXiv:2209.05433. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2209.05433"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00067"},{"key":"e_1_3_2_8_2","volume-title":"International Conference on Learning Representations","author":"Fox Sean","year":"2020","unstructured":"Sean Fox, Seyedramin Rasoulinezhad, Julian Faraone, David Boland, and Philip Leong. 2020. A block minifloat representation for training deep neural networks. In International Conference on Learning Representations."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/APCCAS55924.2022.10090282"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD57390.2023.10323638"},{"key":"e_1_3_2_11_2","unstructured":"David Elam and Cesar Lovescu. 2003. A block floating point implementation for an N-point FFT on the TMS320C55X DSP. Texas Instruments Application Report. Retrieved from https:\/\/www.ti.com\/lit\/an\/spra948\/spra948.pdf"},{"key":"e_1_3_2_12_2","article-title":"Training DNNs with hybrid block floating point","author":"Drumond Mario","year":"2018","unstructured":"Mario Drumond, Tao Lin, Martin Jaggi, and Babak Falsafi. 2018. Training DNNs with hybrid block floating point. In Advances in Neural Information Processing Systems.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_13_2","unstructured":"Simla Burcu Harma Ayan Chakraborty Babak Falsafi Martin Jaggi and Yunho Oh. 2022. Accuracy boosters: Epoch-driven mixed-mantissa block floating-point for DNN training. arXiv:2211.10737. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2211.10737"},{"key":"e_1_3_2_14_2","volume-title":"International Conference on Learning Representations","author":"Oreshkin Boris N.","year":"2019","unstructured":"Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1137\/18M1165748"},{"key":"e_1_3_2_16_2","article-title":"Training deep neural networks with 8-bit floating point numbers","author":"Wang Naigang","year":"2018","unstructured":"Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_17_2","unstructured":"Bita Darvish Rouhani Ritchie Zhao Ankit More Mathew Hall Alireza Khodamoradi Summer Deng Dhruv Choudhary Marius Cornea Eric Dellinger Kristof Denolf et al. 2023. Microscaling data formats for deep learning. arXiv:2310.10537. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2310.10537"},{"key":"e_1_3_2_18_2","unstructured":"AMD Xilinx. 2019. Systolic Array. Retrieved from https:\/\/xilinx.github.io\/Vitis_Accel_Examples\/2019.2\/html\/systolic_array.html"},{"key":"e_1_3_2_19_2","unstructured":"Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices White Paper (WP486). Retrieved from https:\/\/docs.xilinx.com\/v\/u\/en-US\/wp486-deep-learning-int8"},{"key":"e_1_3_2_20_2","unstructured":"Xilinx. 2020. Convolutional Neural Network with Int4 Optimization on Xilinx Devices (wp521). Retrieved from https:\/\/docs.xilinx.com\/v\/u\/en-US\/wp521-4bit-optimization"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1109\/ICFPT59805.2023.00023","volume-title":"2023 International Conference on Field Programmable Technology (ICFPT)","author":"Qi Liu","year":"2023","unstructured":"Liu Qi, Sun Mo, Sun Jie, Lu Liqiang, Zhao Jieru, and Wang Zeke. 2023. SSiMD: Supporting six signed multiplications in a DSP block for low-precision CNN on FPGAs. In 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 161\u2013169."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474597"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijforecast.2018"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415643"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2018.00021"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1088\/1674-4926\/41\/2\/022403"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/IEEESTD.2019.8766229"},{"key":"e_1_3_2_28_2","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI \u201916)","volume":"16","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI \u201916), Vol. 16, 265\u2013283."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2021.3078316"},{"key":"e_1_3_2_30_2","unstructured":"Greg Palmer Michael Andersch. 2022. Inside the Nvidia Hooper Architecture. Retrieved from https:\/\/rd.yyrcd.com\/CUDA\/2022-GTC\/S42663-Inside_the_NVIDIA_Hopper_Architecture.pdf"},{"key":"e_1_3_2_31_2","unstructured":"Using FP8 with Transformer Engine. 2022. Retrieved from https:\/\/docs.nvidia.com\/deeplearning\/transformer-engine\/user-guide\/examples\/fp8_primer.html"},{"key":"e_1_3_2_32_2","unstructured":"Bita Darvish Rouhani Ritchie Zhao Venmugil Elango Rasoul Shafipour Mathew Hall Maral Mesmakhosroshahi Ankit More Levi Melnick Maximilian Golub Girish Varatkar et al. 2023. Microscaling data formats for deep learning. arXiv:2310.10537. Retrieved from https:\/\/arxiv.org\/abs\/2310.10537"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293977"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3707209","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3707209","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,5]],"date-time":"2025-07-05T11:52:45Z","timestamp":1751716365000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3707209"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,21]]},"references-count":32,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3707209"],"URL":"https:\/\/doi.org\/10.1145\/3707209","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2025,3,21]]},"assertion":[{"value":"2024-06-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}