{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T16:42:37Z","timestamp":1776444157983,"version":"3.51.2"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,3,22]],"date-time":"2023-03-22T00:00:00Z","timestamp":1679443200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Taiwan NSTC"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>Today, as deep learning (DL) is applied more often in daily life, dedicated processors such as CPUs and GPUs have become very important for accelerating model executions. With the growth of technology, people are becoming accustomed to using edge devices, such as mobile phones, smart watches, and VR devices in their daily lives. A variety of technologies using DL are gradually being applied to these edge devices. However, there is a large number of computations in DL. It faces a challenging problem how to provide solutions in the edge devices. In this article, the proposed method enables a flow with the RISC-V Packed extension (P extension) in TVM. TVM, an open deep learning compiler for neural network models, is growing as a key infrastructure for DL computing. RISC-V is an open instruction set architecture (ISA) with customized and flexible features. The Packed-SIMD extension is a RISC-V extension that enables subword single-instruction multiple-data (SIMD) computations in RISC-V architectures to support fallback engines in AI computing. In the proposed flow, a fixed-point type that is supported by an integer of 16-bit type and saturation instructions is added to replace the original 32-bit float type. In addition, an auto-tuning method is proposed to use a uniform selector mechanism (USM) to find the binary point position for fixed-point type use. The tensorization feature of TVM can be used to optimize specific hardware such as subword SIMD instructions with RISC-V P extension. With our experiment on the Spike simulator, the proposed method with the USM can improve performance by approximately 2.54 to 6.15\u00d7 in terms of instruction counts with little accuracy loss.<\/jats:p>","DOI":"10.1145\/3569939","type":"journal-article","created":{"date-parts":[[2022,11,2]],"date-time":"2022-11-02T13:21:23Z","timestamp":1667395283000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Auto-tuning Fixed-point Precision with TVM on RISC-V Packed SIMD Extension"],"prefix":"10.1145","volume":"28","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1897-5338","authenticated-orcid":false,"given":"Chun-Chieh","family":"Yang","sequence":"first","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8429-3282","authenticated-orcid":false,"given":"Yi-Ru","family":"Chen","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4674-9872","authenticated-orcid":false,"given":"Hui-Hsin","family":"Liao","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7831-987X","authenticated-orcid":false,"given":"Yuan-Ming","family":"Chang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9919-6258","authenticated-orcid":false,"given":"Jenq-Kuen","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3,22]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Szymon Migacz. 2017. 8-bit Inference with TensorRT. Retrieved from https:\/\/on-demand.gputechconf.com\/gtc\/2017\/presentation\/s7310-8-bit-inference-with-tensorrt.pdf."},{"key":"e_1_3_1_3_2","first-page":"265","volume-title":"12th Symposium on Operating Systems Design and Implementation","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. Tensorflow: A system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation. 265\u2013283."},{"key":"e_1_3_1_4_2","unstructured":"Andes Technology. 2019. Andes has donated RISC-V P-extension draft 2019. Retrieved from http:\/\/www.andestech.com\/en\/2019\/12\/31\/a-look-back-at-the-achievements-andes-made-in-2019\/."},{"key":"e_1_3_1_5_2","unstructured":"Andes Technology. 2005. Andes Technology. Retrieved from http:\/\/www.andestech.com\/en\/homepage\/."},{"key":"e_1_3_1_6_2","unstructured":"Apache MXNet. 2015. Apache MXNet (incubating) for Deep Learning. Retrieved from https:\/\/github.com\/apache\/incubator-mxnet."},{"key":"e_1_3_1_7_2","article-title":"Case study: Devise quantized schedule primitives in halide to support darknet computation","author":"Lee Ming-Yu Hung, Chao-Lin Lee, and Jenq-Kuen","year":"2021","unstructured":"Ming-Yu Hung, Chao-Lin Lee, and Jenq-Kuen Lee. 2021. Case study: Devise quantized schedule primitives in halide to support darknet computation. In Workshop on Compiler Techniques and System Software for High-Performance and Embedding Computing.","journal-title":"Workshop on Compiler Techniques and System Software for High-Performance and Embedding Computing"},{"key":"e_1_3_1_8_2","article-title":"MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems","author":"Chen Tianqi","year":"2015","unstructured":"Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).","journal-title":"arXiv preprint arXiv:1512.01274"},{"key":"e_1_3_1_9_2","first-page":"578","volume-title":"13th Symposium on Operating Systems Design and Implementation","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et\u00a0al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th Symposium on Operating Systems Design and Implementation. 578\u2013594."},{"key":"e_1_3_1_10_2","first-page":"1","volume-title":"International Symposium on VLSI Design, Automation and Test (VLSI-DAT)","author":"Chen Yi-Ru","year":"2020","unstructured":"Yi-Ru Chen, Hui-Hsin Liao, Chia-Hsuan Chang, Che-Chia Lin, Chao-Lin Lee, Yuan-Ming Chang, Chun-Chieh Yang, and Jenq-Kuen Lee. 2020. Experiments and optimizations for TVM on RISC-V architectures with p extension. In International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 1\u20134."},{"key":"e_1_3_1_11_2","unstructured":"Core ML. 2017. Core ML. Retrieved from https:\/\/developer.apple.com\/documentation\/coreml."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_13_2","unstructured":"Fixed-Point Real Numbers. 2018. Fixed-Point Real Numbers. Retrieved from http:\/\/www.open-std.org\/jtc1\/sc22\/wg21\/docs\/papers\/2018\/p0037r5.html."},{"issue":"5","key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1159\/000505021","article-title":"Machine learning in fetal cardiology: What to expect","volume":"47","author":"Garcia-Canadilla Patricia","year":"2020","unstructured":"Patricia Garcia-Canadilla, Sergio Sanchez-Martinez, Fatima Crispi, and Bart Bijnens. 2020. Machine learning in fetal cardiology: What to expect. Fetal Diag. Therap. 47, 5 (2020), 363\u2013372.","journal-title":"Fetal Diag. Therap."},{"key":"e_1_3_1_15_2","article-title":"Deep residual learning for image recognition","volume":"1512","author":"He Kaiming","year":"2015","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs\/1512.03385 (2015).","journal-title":"CoRR"},{"key":"e_1_3_1_16_2","unstructured":"William Dally. 2015. High-Performance Hardware for Machine Learning. Retrieved from https:\/\/media.nips.cc\/Conferences\/2015\/tutorialslides\/Dally-NIPS-Tutorial-2015.pdf."},{"key":"e_1_3_1_17_2","unstructured":"Sepp Hochreiter Yoshua Bengio Paolo Frasconi J\u00fcrgen Schmidhuber and others. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press In."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2014.6757323"},{"key":"e_1_3_1_19_2","article-title":"Mobilenets: Efficient convolutional neural networks for mobile vision applications","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).","journal-title":"arXiv preprint arXiv:1704.04861"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3242044"},{"key":"e_1_3_1_21_2","article-title":"SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size","author":"Iandola Forrest N.","year":"2016","unstructured":"Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).","journal-title":"arXiv preprint arXiv:1602.07360"},{"key":"e_1_3_1_22_2","first-page":"2704","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition","author":"Jacob Benoit","year":"2018","unstructured":"Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition. 2704\u20132713."},{"key":"e_1_3_1_23_2","unstructured":"Keras. 2015. Keras. Retrieved from https:\/\/keras.io\/."},{"key":"e_1_3_1_24_2","unstructured":"KMADA. 2019. RISC-V P Extension Proposal. Retrieved from https:\/\/github.com\/riscv\/riscv-p-spec\/blob\/master\/P-ext-proposal.adoc#kmada-kmaxda."},{"issue":"7","key":"e_1_3_1_25_2","first-page":"1","article-title":"Convolutional deep belief networks on CIFAR-10","volume":"40","author":"Krizhevsky Alex","year":"2010","unstructured":"Alex Krizhevsky and Geoff Hinton. 2010. Convolutional deep belief networks on CIFAR-10. Unpublished Manuscript 40, 7 (2010), 1\u20139.","journal-title":"Unpublished Manuscript"},{"key":"e_1_3_1_26_2","first-page":"1097","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097\u20131105.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1845"},{"key":"e_1_3_1_28_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Kung Hsiang-Tsung","year":"2020","unstructured":"Hsiang-Tsung Kung, Bradley McDanel, and Sai Qian Zhang. 2020. Term quantization: Furthering quantization at run time. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201316."},{"key":"e_1_3_1_29_2","first-page":"75","volume-title":"International Symposium on Code Generation and Optimization.","author":"Lattner Chris","year":"2004","unstructured":"Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. IEEE, 75\u201386."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2020.101783"},{"key":"e_1_3_1_32_2","article-title":"Convolutional neural networks using logarithmic data representation","author":"Miyashita Daisuke","year":"2016","unstructured":"Daisuke Miyashita, Edward H. Lee, and Boris Murmann. 2016. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025 (2016).","journal-title":"arXiv preprint arXiv:1603.01025"},{"key":"e_1_3_1_33_2","unstructured":"numpy. 1995. numpy. Retrieved from https:\/\/numpy.org\/."},{"key":"e_1_3_1_34_2","unstructured":"OpenCL 2009. Open Computing Language. Retrieved from https:\/\/https:\/\/www.khronos.org\/opencl\/."},{"key":"e_1_3_1_35_2","first-page":"8026","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et\u00a0al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019), 8026\u20138037.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2668119"},{"key":"e_1_3_1_37_2","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).","journal-title":"arXiv preprint arXiv:1409.1556"},{"key":"e_1_3_1_38_2","unstructured":"RISC-V International. 2010. Spike a RISC-V ISA Simulator. Retrieved from https:\/\/github.com\/riscv-software-src\/riscv-isa-sim."},{"key":"e_1_3_1_39_2","unstructured":"Yi-Ru Chen. 2020. Support TVM QNN Flow on RISC-V with SIMD Computation. Retrieved from https:\/\/discuss.tvm.apache.org\/t\/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation\/7967."},{"key":"e_1_3_1_40_2","unstructured":"TensorFlow. 2021. TensorFlow Lite 8-bit quantization specification. Retrieved from https:\/\/www.tensorflow.org\/lite\/performance\/quantization_spec."},{"key":"e_1_3_1_41_2","unstructured":"Microsoft. 2015. The Microsoft Cognitive Toolkit (CNTK). Retrieved from https:\/\/github.com\/microsoft\/CNTK."},{"issue":"2","key":"e_1_3_1_42_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3133218","article-title":"Architecture and compiler support for GPUs using energy-efficient affine register files","volume":"23","author":"Wang Shao-Chung","year":"2017","unstructured":"Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and compiler support for GPUs using energy-efficient affine register files. ACM Trans. Des. Autom. Electron. Syst. 23, 2 (2017), 1\u201325.","journal-title":"ACM Trans. Des. Autom. Electron. Syst."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13244-018-0639-9"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3569939","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3569939","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:07:51Z","timestamp":1750183671000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3569939"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,22]]},"references-count":42,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3569939"],"URL":"https:\/\/doi.org\/10.1145\/3569939","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,22]]},"assertion":[{"value":"2022-02-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}