{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T08:09:23Z","timestamp":1772179763378,"version":"3.50.1"},"reference-count":27,"publisher":"SAGE Publications","issue":"6","license":[{"start":{"date-parts":[[2025,7,13]],"date-time":"2025-07-13T00:00:00Z","timestamp":1752364800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"},{"start":{"date-parts":[[2025,7,13]],"date-time":"2025-07-13T00:00:00Z","timestamp":1752364800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:p>Transformers, particularly large language models (LLMs), are revolutionizing applications in natural language processing and computer vision but at a high cost in memory, energy, and computational resources. Quantization has emerged as an effective compression method to alleviate these demands, reducing the bitwidth of model data and arithmetic precision to enable efficient inference on resource-constrained devices. This paper focuses on optimizing inference with transformer encoders on low-power general-purpose CPUs, as those often found in edge devices. Our key contributions include exposing the critical role of linear layers within transformer encoders on CPUs with a limited number of cores; developing mixed integer precision matrix multiplication on ARM and RISC-V CPUs; and evaluating performance impact and energy savings of quantized inference. In summary, this work highlights the advantages of applying quantization to transformer encoders on current single-core and multi-core low power CPUs, offering insights for efficient LLM deployment on edge platforms.<\/jats:p>","DOI":"10.1177\/10943420251355115","type":"journal-article","created":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T03:21:48Z","timestamp":1752463308000},"page":"803-821","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Characterization of quantized inference with transformer encoders on low power CPUs"],"prefix":"10.1177","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5891-4479","authenticated-orcid":false,"given":"H\u00e9ctor","family":"Mart\u00ednez","sequence":"first","affiliation":[{"name":"Universidad de C\u00f3rdoba"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9321-2728","authenticated-orcid":false,"given":"Sandra","family":"Catal\u00e1n","sequence":"additional","affiliation":[{"name":"Universitat Jaume I"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8576-8451","authenticated-orcid":false,"given":"Adri\u00e1n","family":"Castell\u00f3","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Val\u00e8ncia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5454-165X","authenticated-orcid":false,"given":"Enrique S","family":"Quintana-Ort\u00ed","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Val\u00e8ncia"}]}],"member":"179","published-online":{"date-parts":[[2025,7,13]]},"reference":[{"key":"e_1_3_4_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2023.102990"},{"key":"e_1_3_4_3_1","unstructured":"Dettmers T Lewis M Belkada Y et al (2024) LLM.int8(): 8-bit matrix multiplication for transformers at scale. In: Proceedings of the 36th International Conference on Neural Information Processing Systems NIPS \u201922 Red Hook NY USA 28 Nov - 9 Dec 2022. Curran Associates Inc. ISBN 9781713871088."},{"key":"e_1_3_4_4_1","doi-asserted-by":"crossref","unstructured":"Devlin J Chang MW Lee K et al. (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proc. 2019 Conf. North American Chapter Assoc June 3 2019 - June 5 Minneapolis MN USA. Computational Linguistics: Human Language Techn 4171\u20134186.","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_4_5_1","volume-title":"Optimizing Inference Performance of Transformers on CPUs","author":"Dice D","year":"2021","unstructured":"Dice D, Kogan A (2021) Optimizing Inference Performance of Transformers on CPUs. https:\/\/arxiv.org\/abs\/2102.06621"},{"key":"e_1_3_4_6_1","volume-title":"High Performance Computing","author":"Dowd K","year":"1998","unstructured":"Dowd K, Severance CR (1998) High Performance Computing. 2nd edition. Sebastopol, California: O\u2019Reilly.","edition":"2"},{"key":"e_1_3_4_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3210754"},{"key":"e_1_3_4_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_3_4_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_3_4_10_1","doi-asserted-by":"crossref","unstructured":"Heinecke A Henry G Hutchinson M et al. (2016) LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: Proc. Int. Conference for High Performance Computing Networking Storage and Analysis SC \u201916 November 13-18 Salt Lake City Utah USA. IEEE Press. ISBN 9781467388153.","DOI":"10.1109\/SC.2016.83"},{"issue":"1","key":"e_1_3_4_11_1","first-page":"6869","article-title":"Quantized neural networks: training neural networks with low precision weights and activations","volume":"18","author":"Hubara I","year":"2017","unstructured":"Hubara I, Courbariaux M, Soudry D, et al. (2017) Quantized neural networks: training neural networks with low precision weights and activations. Journal of Machine Learning Research 18(1): 6869\u20136898.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_4_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2013.6"},{"key":"e_1_3_4_13_1","volume-title":"INA3221 Triple-Channel, High-Side Measurement, Shunt and Bus Voltage Monitor with I2C- and SMBUS-Compatible Interface","author":"Instruments T","year":"2016","unstructured":"Instruments T (2016) INA3221 Triple-Channel, High-Side Measurement, Shunt and Bus Voltage Monitor with I2C- and SMBUS-Compatible Interface. https:\/\/www.ti.com\/product\/INA3221#tech-docs"},{"key":"e_1_3_4_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3280805"},{"key":"e_1_3_4_15_1","volume-title":"I-BERT: Integer-Only BERT Quantization","author":"Kim S","year":"2021","unstructured":"Kim S, Gholami A, Yao Z, et al. (2021) I-BERT: Integer-Only BERT Quantization. https:\/\/arxiv.org\/abs\/2101.01321"},{"key":"e_1_3_4_16_1","volume-title":"Full Stack Optimization of Transformer Inference: A Survey","author":"Kim S","year":"2023","unstructured":"Kim S, Hooper C, Wattanawong T, et al. (2023) Full Stack Optimization of Transformer Inference: A Survey. https:\/\/arxiv.org\/abs\/2302.14017"},{"issue":"2","key":"e_1_3_4_17_1","first-page":"12:1","article-title":"Analytical modeling is enough for high-performance BLIS","volume":"43","author":"Low TM","year":"2016","unstructured":"Low TM, Igual FD, Smith TM, et al. (2016) Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software 43(2): 12:1\u201312:18.","journal-title":"ACM Transactions on Mathematical Software"},{"key":"e_1_3_4_18_1","doi-asserted-by":"crossref","unstructured":"Mart\u00ednez H Igual FD Rodr\u00edguez-S\u00e1nchez R et al. (2024) Inference with transformer encoders on ARM and RISC-V multicore processors. In: Euro-Par 2024: Parallel Processing August 26-30 Madrid Spain. Springer 377\u2013392. ISBN 978-3-031-69766-1.","DOI":"10.1007\/978-3-031-69766-1_26"},{"key":"e_1_3_4_19_1","doi-asserted-by":"crossref","unstructured":"Shankar S Reuther A (2022) Trends in energy estimates for computing in AI\/machine learning accelerators supercomputers and compute-intensive applications. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC) September 19 - 23 1\u20138.","DOI":"10.1109\/HPEC55821.2022.9926296"},{"key":"e_1_3_4_20_1","doi-asserted-by":"crossref","unstructured":"Smith TM van de Geijn R Smelyanskiy M et al. (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp. IPDPS\u201914 May 19-23 Phoenix Arizona USA. 1049\u20131059.","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_3_4_21_1","doi-asserted-by":"crossref","unstructured":"Socher R Perelygin A Wu J et al. (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing Seattle Washington USA October 18\u201321 2013. Association for Computational Linguistics 1631\u20131642. https:\/\/www.aclweb.org\/anthology\/D13-1170","DOI":"10.18653\/v1\/D13-1170"},{"key":"e_1_3_4_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_3_4_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2755561"},{"key":"e_1_3_4_24_1","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani A","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. Advances in Neural Information Processing Systems 30: 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_4_25_1","doi-asserted-by":"publisher","unstructured":"Wu D Meng J Zhu W et al. (2024) autoGEMM: pushing the limits of irregular matrix multiplication on Arm architectures. In: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis SC \u201924 November 17 - 22 Atlanta Georgia. IEEE Press. ISBN 9798350352917. DOI: 10.1109\/SC41406.2024.00027.","DOI":"10.1109\/SC41406.2024.00027"},{"key":"e_1_3_4_26_1","unstructured":"Xiao G Lin J Seznec M et al. (2023) Smoothquant: accurate and efficient post-training quantization for large language models. In: Proceedings of the 40th International Conference on Machine Learning ICML\u201923 July 23-29 Honolulu HI USA. JMLR.org."},{"key":"e_1_3_4_27_1","doi-asserted-by":"publisher","unstructured":"Yang W Fang J Dong D et al (2021) LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis SC \u201921 New York NY USA November 14\u201319 2021. Association for Computing Machinery. DOI: 10.1145\/3458817.3476217. ISBN 9781450384421.","DOI":"10.1145\/3458817.3476217"},{"key":"e_1_3_4_28_1","volume-title":"A Survey of Large Language Models","author":"Zhao WX","year":"2023","unstructured":"Zhao WX, Zhou K, Li J, et al. (2023) A Survey of Large Language Models. https:\/\/arxiv.org\/abs\/2303.18223"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251355115","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420251355115","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251355115","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,17]],"date-time":"2025-11-17T04:10:55Z","timestamp":1763352655000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420251355115"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,13]]},"references-count":27,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["10.1177\/10943420251355115"],"URL":"https:\/\/doi.org\/10.1177\/10943420251355115","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,13]]}}}