{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T02:49:44Z","timestamp":1775789384893,"version":"3.50.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"name":"Japan Society for the Promotion of Science (JSPS) KAKENHI","award":["24KJ2152"],"award-info":[{"award-number":["24KJ2152"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>\n                    Deploying deep neural networks (DNNs) on edge devices presents notable challenges, including execution time, power consumption, and memory footprint. To address these limitations, the co-design of software-based model compression techniques and dedicated hardware has become crucial for the efficient deployment of DNNs on edge devices. However, the hardware needs to support various model compression techniques, and specific compression formats introduce limitations to the effective use of the conventional SIMD, such as low-bit-width precision, fine-grained mixed precision, and sparse matrices. To overcome these issues, we propose SIMD-CP, a SIMD architecture featuring tag-based precision detection and redundant bit-width compression, which is represented as compression packing. Specifically, we introduce two novel SIMD instructions: (i) a tagged vector load instruction (\n                    <jats:italic toggle=\"yes\">tvl<\/jats:italic>\n                    ), which fetches quantized vectors from memory while appending bit-width metadata as tags, and (ii) a packing dot-product instruction (\n                    <jats:italic toggle=\"yes\">pdotp<\/jats:italic>\n                    ), which detects the precision levels of elements and packs them into suitable multipliers. Experimental evaluations show that our approach achieves a 2.0\u00d7 MAC\/cycle gain on both fine-grained mixed-precision and sparse-matrix formats by a series of instructions, i.e.,\n                    <jats:italic toggle=\"yes\">tvl<\/jats:italic>\n                    and\n                    <jats:italic toggle=\"yes\">pdotp<\/jats:italic>\n                    . Furthermore, SIMD-CP obtains a 2.70 \u223c 3.40\u00d7 GOPs\/W and a 2.31 \u223c 2.42\u00d7 OPs\/LUT improvement for mixed-precision convolution, outperforming the cutting-edge mixed-precision SIMD. These diverse model compression supports allow 28.8 \u223c 45.5% latency reduction for DNN applications, including tiny CNN and edge-aware Vision Transformer, with mitigating accuracy degradation within 1.2 \u223c 2.1%. We also provide the scaling of the SIMD-CP architecture, resulting in a 1.8% LUT utilization increase in the small-scale compared with the conventional mixed-precision SIMD.\n                  <\/jats:p>","DOI":"10.1145\/3771939","type":"journal-article","created":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T10:14:48Z","timestamp":1760523288000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["SIMD-CP: SIMD with Redundant Bits Compression and Mixed-Precision Packing for Quantized DNNs"],"prefix":"10.1145","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-2294-5208","authenticated-orcid":false,"given":"Hayata","family":"Kaneko","sequence":"first","affiliation":[{"name":"Ritsumeikan University College of Science and Engineering Graduate School of Science and Engineering","place":["Kusatsu, Japan"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-3161-2200","authenticated-orcid":false,"given":"Ryuto","family":"Ishibashi","sequence":"additional","affiliation":[{"name":"Ritsumeikan University College of Science and Engineering Graduate School of Science and Engineering","place":["Kusatsu, Japan"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4351-6923","authenticated-orcid":false,"given":"Lin","family":"Meng","sequence":"additional","affiliation":[{"name":"Ritsumeikan University College of Science and Engineering Graduate School of Science and Engineering","place":["Kusatsu, Japan"]}]}],"member":"320","published-online":{"date-parts":[[2026,1,7]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3676536.3676840"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/IGSC55832.2022.9969370"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/COINS51742.2021.9524173"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00027"},{"key":"e_1_3_2_6_2","unstructured":"Zihan Chen Bike Xie Jundong Li and Cong Shen. 2024. Channel-wise mixed-precision quantization for large language models. arXiv:2410.13056. Retrieved from https:\/\/arxiv.org\/abs\/2410.13056"},{"key":"e_1_3_2_7_2","first-page":"348","article-title":"Accurate and efficient 2-bit quantized neural networks","volume":"1","author":"Choi Jungwook","year":"2019","unstructured":"Jungwook Choi, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Kailash Gopalakrishnan, Zhuo Wang, and Pierce Chuang. 2019. Accurate and efficient 2-bit quantized neural networks. Proceedings of Machine Learning and Systems 1 (2019), 348\u2013359.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00363"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11265-015-1070-9"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/PATMOS.2017.8106976"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/NEWCAS50681.2021.9462781"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2018.8445101"},{"key":"e_1_3_2_13_2","volume-title":"Proceedings of the 5th Workshop on Computer Architecture Research With RISC-V (CARRV)","author":"Gallmann Noam","year":"2021","unstructured":"Noam Gallmann, Pirmin Vogel, Pasquale Davide Schiavone, and Luca Benini. 2021. From swift to mighty: A cost-benefit analysis of ibex and CV32E40P regarding application performance, power and area. In Proceedings of the 5th Workshop on Computer Architecture Research With RISC-V (CARRV)."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1098\/rsta.2019.0155"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE48585.2020.9116529"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2017.2654506"},{"key":"e_1_3_2_17_2","volume-title":"CV32E40P User Manual","author":"Group OpenHW","year":"2020","unstructured":"OpenHW Group. 2020. CV32E40P User Manual. Retrieved from https:\/\/docs.openhwgroup.org\/projects\/cv32e40p-user-manual\/en\/latest\/index.html"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01186"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASP-DAC58780.2024.10473817"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-025-06265-z"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00286"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CODES-ISSS60120.2024.00013"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Hongwei Jiang Dongsheng Liu Xinyi Ding Yaning Chen and Hongtao Li. 2025. TCM: An efficient lightweight MLP-based network with affine transformation for long-term time series forecasting. Neurocomputing 617 C (2025) 128960.","DOI":"10.1016\/j.neucom.2024.128960"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/IAI63275.2024.10730301"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2017.2741463"},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Daeun Kim Jinwoo Hwang Changhun Oh and Jongse Park. 2025. MixDiT: Accelerating image diffusion transformer inference with mixed-precision MX quantization. IEEE Computer Architecture Letters 24 1 (2025) 141\u2013144.","DOI":"10.1109\/LCA.2025.3560786"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00259"},{"key":"e_1_3_2_28_2","unstructured":"Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Master\u2019s thesis University of Toronto (2009) 5."},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Eldar Kurtic Alexandre Marques Shubhra Pandit Mark Kurtz and Dan Alistarh. 2024. \u201d Give Me BF16 or Give Me Death\u201d? Accuracy-performance trade-offs in LLM quantization. arXiv:2411.02355. Retrieved from https:\/\/arxiv.org\/abs\/2411.02355","DOI":"10.18653\/v1\/2025.acl-long.1304"},{"key":"e_1_3_2_30_2","unstructured":"Liangzhen Lai Naveen Suda and Vikas Chandra. 2018. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv:1801.06601. Retrieved from https:\/\/arxiv.org\/abs\/1801.06601"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TWC.2019.2946140"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-022-10221-5"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11554-024-01496-8"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.3390\/computers12030060"},{"key":"e_1_3_2_35_2","unstructured":"Simone Machetti Pasquale Davide Schiavone Thomas Christoph M\u00fcller Miguel Pe\u00f3n-Quir\u00f3s and David Atienza. 2024. X-heep: An open-source configurable and extendible risc-v microcontroller for the exploration of ultra-low-power edge accelerators. arXiv:2401.05548. Retrieved from https:\/\/arxiv.org\/abs\/2401.05548"},{"key":"e_1_3_2_36_2","article-title":"An end-to-end flow to deploy and accelerate TinyML mixed-precision models on RISC-V MCUs","author":"Manca Edward","year":"2024","unstructured":"Edward Manca, Luca Urbinati, and Mario R Casu. 2024. An end-to-end flow to deploy and accelerate TinyML mixed-precision models on RISC-V MCUs. Authorea Preprints (2024).","journal-title":"Authorea Preprints"},{"issue":"3","key":"e_1_3_2_37_2","first-page":"1708","article-title":"Hardware accelerator design for sparse DNN inference and training: A tutorial","volume":"71","author":"Mao Wendong","year":"2023","unstructured":"Wendong Mao, Meiqi Wang, Xiaoru Xie, Xiao Wu, and Zhongfeng Wang. 2023. Hardware accelerator design for sparse DNN inference and training: A tutorial. IEEE Transactions on Circuits and Systems II: Express Briefs 71, 3 (2023), 1708\u20131714.","journal-title":"IEEE Transactions on Circuits and Systems II: Express Briefs"},{"key":"e_1_3_2_38_2","volume-title":"Arm-Helium-Technology: A Reference Book","author":"Marsh Jon","year":"2020","unstructured":"Jon Marsh. 2020. Arm-Helium-Technology: A Reference Book. Retrieved from https:\/\/github.com\/arm-university\/Arm-Helium-Technology"},{"key":"e_1_3_2_39_2","unstructured":"Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: Light-weight general-purpose and mobile-friendly vision transformer. arXiv:2110.02178. Retrieved from https:\/\/arxiv.org\/abs\/2110.02178"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISLPED58423.2023.10244508"},{"key":"e_1_3_2_41_2","unstructured":"Markus Nagel Marios Fournarakis Rana Ali Amjad Yelysei Bondarenko Mart Van Baalen and Tijmen Blankevoort. 2021. A white paper on neural network quantization. arXiv:2106.08295. Retrieved from https:\/\/arxiv.org\/abs\/2106.08295"},{"key":"e_1_3_2_42_2","volume-title":"core-v-verif","year":"2020","unstructured":"OpenHWGroup. 2020. core-v-verif. Retrieved from https:\/\/github.com\/openhwgroup\/core-v-verif"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI49217.2020.000-5"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Hayat Rajani Nuno Gracias and Rafael Garcia. 2023. A convolutional vision transformer for semantic segmentation of side-scan sonar data. Ocean Eng. 286 2 (2023) 115647.","DOI":"10.1016\/j.oceaneng.2023.115647"},{"key":"e_1_3_2_45_2","doi-asserted-by":"crossref","unstructured":"Muhammad Sabih Abrarul Karim Jakob Wittmann Frank Hannig and J\u00fcrgen Teich. 2024. Hardware\/software co-design of RISC-V extensions for accelerating sparse DNNs on FPGAs. In 2024 International Conference on Field Programmable Technology (ICFPT). IEEE 01\u201309.","DOI":"10.1109\/ICFPT64416.2024.11113397"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/S3S.2018.8640145"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/RCAR61438.2024.10671295"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2025.3551802"},{"issue":"1","key":"e_1_3_2_49_2","first-page":"1","article-title":"A survey of design and optimization for systolic array-based dnn accelerators","volume":"56","author":"Xu Rui","year":"2023","unstructured":"Rui Xu, Sheng Ma, Yang Guo, and Dongsheng Li. 2023. A survey of design and optimization for systolic array-based dnn accelerators. ACM Computing Surveys 56, 1 (2023), 1\u201337.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCT.2017.8359956"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE54114.2022.9774692"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3771939","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T15:59:18Z","timestamp":1767801558000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3771939"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,7]]},"references-count":50,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3771939"],"URL":"https:\/\/doi.org\/10.1145\/3771939","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,7]]},"assertion":[{"value":"2025-08-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}