{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T15:39:28Z","timestamp":1774539568181,"version":"3.50.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,11,9]],"date-time":"2021-11-09T00:00:00Z","timestamp":1636416000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>\n            Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for\n            <jats:italic>deep<\/jats:italic>\n            CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1\/top-5 accuracy loss (within 0.5%\/0.3% in our experiments, respectively, and significantly better than existing methods for\n            <jats:italic>deep<\/jats:italic>\n            CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement\n            <jats:italic>four<\/jats:italic>\n            8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale\/Ultrascale+ family, whereas one DSP can implement only\n            <jats:italic>two<\/jats:italic>\n            8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" content-type=\"gif\" xlink:href=\"3474597-inline1.gif\"\/>\n            <\/jats:inline-formula>\n            over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" content-type=\"gif\" xlink:href=\"3474597-inline2.gif\"\/>\n            <\/jats:inline-formula>\n            and 27.5\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" content-type=\"gif\" xlink:href=\"3474597-inline3.gif\"\/>\n            <\/jats:inline-formula>\n            and average throughput per DSP by 4.1\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" content-type=\"gif\" xlink:href=\"3474597-inline4.gif\"\/>\n            <\/jats:inline-formula>\n            and 5\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:inline-graphic xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" content-type=\"gif\" xlink:href=\"3474597-inline5.gif\"\/>\n            <\/jats:inline-formula>\n            , respectively.\n          <\/jats:p>","DOI":"10.1145\/3474597","type":"journal-article","created":{"date-parts":[[2021,11,9]],"date-time":"2021-11-09T21:11:34Z","timestamp":1636492294000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":39,"title":["Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration"],"prefix":"10.1145","volume":"15","author":[{"given":"Chen","family":"Wu","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering, University of California, Westwood, Los Angeles, CA"}]},{"given":"Mingyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of California, Westwood, Los Angeles, CA"}]},{"given":"Xinyuan","family":"Chu","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of California, Westwood, Los Angeles, CA"}]},{"given":"Kun","family":"Wang","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of California, Westwood, Los Angeles, CA"}]},{"given":"Lei","family":"He","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of California, Westwood, Los Angeles, CA"}]}],"member":"320","published-online":{"date-parts":[[2021,11,9]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00061"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045410"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/2654822.2541967"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293915"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174999"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969588"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999271"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.39"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2017.2705069"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001163"},{"key":"e_1_3_2_12_2","article-title":"Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding","author":"Han Song","year":"2015","unstructured":"Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149. Retrieved from https:\/\/arxiv.org\/abs\/1510.00149.","journal-title":"arXiv:1510.00149"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969239.2969366"},{"key":"e_1_3_2_14_2","first-page":"770","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770\u2013778."},{"key":"e_1_3_2_15_2","first-page":"4700","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Huang Gao","year":"2017","unstructured":"Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700\u20134708."},{"key":"e_1_3_2_16_2","unstructured":"Intel Inc.2021. INT8 vs FP32 Comparison on Select Networks and Platforms. Retrieved from https:\/\/docs.openvinotoolkit.org\/latest\/openvino_docs_performance_int8_vs_fp32.html."},{"key":"e_1_3_2_17_2","unstructured":"Intel Inc.2021. OpenVINO\u2122 Model Server Benchmark Results. Retrieved from https:\/\/docs.openvinotoolkit.org\/latest\/openvino_docs_performance_benchmarks_ovms.html."},{"key":"e_1_3_2_18_2","unstructured":"Nvidia Inc.2020. Jetson Xavier NX. Retrieved from https:\/\/www.nvidia.com\/en-us\/autonomous-machines\/embedded-systems\/jetson-xavier-nx\/."},{"key":"e_1_3_2_19_2","unstructured":"Nvidia Inc.2021. torch2trt. Retrieved from https:\/\/github.com\/NVIDIA-AI-IOT\/torch2trt."},{"key":"e_1_3_2_20_2","article-title":"Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks","author":"Jain Sambhav R","year":"2019","unstructured":"Sambhav R Jain, Albert Gural, Michael Wu, and Chris H Dick. 2019. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv:1903.08066. Retrieved from https:\/\/arxiv.org\/abs\/1903.08066.","journal-title":"arXiv:1903.08066"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439477"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_3_2_23_2","unstructured":"Liangzhen Lai Naveen Suda and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv:1703.03073. Retrieved from https:\/\/arxiv.org\/abs\/1703.03073."},{"key":"e_1_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Xiaocong Lian Zhenyu Liu Zhourui Song Jiwu Dai Wei Zhou and Xiangyang Ji. 2019. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27 8 (2019) 1874\u20131885.","DOI":"10.1109\/TVLSI.2019.2913958"},{"key":"e_1_3_2_25_2","unstructured":"Zhouhan Lin Matthieu Courbariaux Roland Memisevic and Yoshua Bengio. 2015. Neural networks with few multiplications. arXiv:1510.03009. Retrieved from https:\/\/arxiv.org\/abs\/1510.03009."},{"key":"e_1_3_2_26_2","unstructured":"IP LogiCORE. 2012. Floating-point Operator v6. 0. Xilinx Inc."},{"key":"e_1_3_2_27_2","first-page":"101","volume-title":"Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201917)","author":"Lu Liqiang","year":"2017","unstructured":"Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201917). IEEE, 101\u2013108."},{"key":"e_1_3_2_28_2","first-page":"60","volume-title":"Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL\u201918)","author":"Luo Cheng","year":"2018","unstructured":"Cheng Luo, Yuhua Wang, Wei Cao, Philip H. W. Leong, and Lingli Wang. 2018. RNA: An accurate residual network accelerator for quantized and reconstructed deep neural networks. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL\u201918). IEEE, 60\u2013603."},{"key":"e_1_3_2_29_2","first-page":"224","volume-title":"Proceedings of the International Forum on Digital TV and Wireless Multimedia Communications","author":"Ma Jing","year":"2017","unstructured":"Jing Ma, Li Chen, and Zhiyong Gao. 2017. Hardware implementation and optimization of tiny-yolo network. In Proceedings of the International Forum on Digital TV and Wireless Multimedia Communications. Springer, 224\u2013234."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2018.2815603"},{"key":"e_1_3_2_31_2","first-page":"784","volume-title":"Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP\u201917)","author":"Mei Chunsheng","year":"2017","unstructured":"Chunsheng Mei, Zhenyu Liu, Yue Niu, Xiangyang Ji, Wei Zhou, and Dongsheng Wang. 2017. A 200MHZ 202.4 GFLOPS@ 10.8 W VGG16 accelerator in Xilinx VX690T. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP\u201917). IEEE, 784\u2013788."},{"key":"e_1_3_2_32_2","volume-title":"Proceedings of the GPU Technology Conference","author":"Migacz Szymon","year":"2017","unstructured":"Szymon Migacz. 2017. 8-bit inference with TensorRT. In Proceedings of the GPU Technology Conference."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174266"},{"issue":"11","key":"e_1_3_2_34_2","first-page":"1","article-title":"Accelerating deep convolutional neural networks using specialized hardware.","volume":"2","author":"Ovtcharov Kalin","year":"2015","unstructured":"Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Resrarch Whitepaper 2, 11 (2015), 1\u20134.","journal-title":"Microsoft Resrarch Whitepaper"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00063"},{"key":"e_1_3_2_36_2","first-page":"580","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Park Eunhyeok","year":"2018","unstructured":"Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision. 580\u2013595."},{"key":"e_1_3_2_37_2","first-page":"1","volume-title":"Proceedings of the 27th International Conference on Field Programmable Logic and Applications","author":"Prost-Boucle Adrien","year":"2017","unstructured":"Adrien Prost-Boucle, Alban Bourge, Fr\u00e9d\u00e9ric P\u00e9trot, Hande Alemdar, Nicholas Caldwell, and Vincent Leroy. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications. 1\u20137."},{"key":"e_1_3_2_38_2","first-page":"525","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Rastegari Mohammad","year":"2016","unstructured":"Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. 525\u2013542."},{"key":"e_1_3_2_39_2","unstructured":"Joseph Redmon. 2013\u20132016. Darknet: Open Source Neural Networks in C. Retrieved from http:\/\/pjreddie.com\/darknet\/."},{"key":"e_1_3_2_40_2","first-page":"779","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Redmon Joseph","year":"2016","unstructured":"Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779\u2013788."},{"key":"e_1_3_2_41_2","first-page":"7263","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Redmon Joseph","year":"2017","unstructured":"Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263\u20137271."},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_43_2","article-title":"Quantizing convolutional neural networks for low-power high-throughput inference engines","author":"Settle Sean O","year":"2018","unstructured":"Sean O Settle, Manasa Bollavaram, Paolo D\u2019Alberto, Elliott Delaye, Oscar Fernandez, Nicholas Fraser, Aaron Ng, Ashish Sirasao, and Michael Wu. 2018. Quantizing convolutional neural networks for low-power high-throughput inference engines. arXiv:1805.07941. Retrieved from https:\/\/arxiv.org\/abs\/1805.07941.","journal-title":"arXiv:1805.07941"},{"key":"e_1_3_2_44_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00068"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.5555\/3504035.3504135"},{"key":"e_1_3_2_47_2","first-page":"1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Szegedy Christian","year":"2015","unstructured":"Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1\u20139."},{"key":"e_1_3_2_48_2","first-page":"6105","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. 6105\u20136114."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.14569\/IJACSA.2018.091062"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3309551"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00881"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174253"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062207"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373087.3375361"},{"key":"e_1_3_2_55_2","unstructured":"Chen Wu Mingyu Wang Xiayu Li Jicheng Lu Kun Wang and Lei He. 2020. Phoenix: A low-precision floating-point quantization oriented architecture for convolutional neural networks. arXiv:2003.02628. Retrieved from https:\/\/arxiv.org\/abs\/2003.02628."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2018.00019"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062244"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Yunxuan Yu Chen Wu Tiandong Zhao Kun Wang and Lei He. 2019. OPU: An FPGA-based overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28 1 (2019) 35\u201347.","DOI":"10.1109\/TVLSI.2019.2939726"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373087.3375311"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2020.2995741"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/2966986.2967011"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195662"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00051"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00011"},{"key":"e_1_3_2_66_2","first-page":"1","article-title":"IEEE standard for floating-point arithmetic","author":"Zuras Dan","year":"2008","unstructured":"Dan Zuras, Mike Cowlishaw, Alex Aiken, Matthew Applegate, David Bailey, Steve Bass, Dileep Bhandarkar, Mahesh Bhat, David Bindel, Sylvie Boldo, et\u00a0al. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (2008), 1\u201370.","journal-title":"IEEE Std 754-2008"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474597","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474597","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:50Z","timestamp":1750191530000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474597"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,9]]},"references-count":65,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.1145\/3474597"],"URL":"https:\/\/doi.org\/10.1145\/3474597","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,9]]},"assertion":[{"value":"2021-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-11-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}