{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,19]],"date-time":"2026-06-19T22:42:32Z","timestamp":1781908952328,"version":"3.54.5"},"reference-count":88,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"name":"ENFIELD project","award":["101120657"],"award-info":[{"award-number":["101120657"]}]},{"name":"NSFEuropean Commission within the HEU Programme, and the National Research Foundation of Korea","award":["RS-2023-00268071"],"award-info":[{"award-number":["RS-2023-00268071"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Internet Things"],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:p>Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4 GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.<\/jats:p>","DOI":"10.1145\/3767742","type":"journal-article","created":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T10:35:15Z","timestamp":1758105315000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency"],"prefix":"10.1145","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9325-1604","authenticated-orcid":false,"given":"Erik Johannes","family":"Husom","sequence":"first","affiliation":[{"name":"SINTEF Digital","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2170-2066","authenticated-orcid":false,"given":"Arda","family":"Goknil","sequence":"additional","affiliation":[{"name":"SINTEF Digital","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4181-963X","authenticated-orcid":false,"given":"Merve","family":"Astekin","sequence":"additional","affiliation":[{"name":"SINTEF Digital","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5130-0407","authenticated-orcid":false,"given":"Lwin Khin","family":"Shar","sequence":"additional","affiliation":[{"name":"Singapore Management University","place":["Singapore, Singapore"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3745-9113","authenticated-orcid":false,"given":"Andre","family":"K\u00c3\u00a5sen","sequence":"additional","affiliation":[{"name":"Oslo Metropolitan University","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5784-7355","authenticated-orcid":false,"given":"Sagar","family":"Sen","sequence":"additional","affiliation":[{"name":"SINTEF Digital","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5875-4023","authenticated-orcid":false,"given":"Benedikt Andreas","family":"Mithassel","sequence":"additional","affiliation":[{"name":"SINTEF Digital","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6034-4137","authenticated-orcid":false,"given":"Ahmet","family":"Soylu","sequence":"additional","affiliation":[{"name":"Kristiania University College","place":["Oslo, Norway"]},{"name":"Seoul National University","place":["Oslo, Norway"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,11,18]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3409745"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3674805.3686684"},{"key":"e_1_3_1_4_2","unstructured":"Guangji Bai Zheng Chai Chen Ling Shiyu Wang Jiaying Lu Nan Zhang Tingwei Shi Ziyang Yu Mengdan Zhu Yifei Zhang Xinyuan Song Carl Yang Yue Cheng and Liang Zhao. 2024. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv:2401.00625. Retrieved from https:\/\/arxiv.org\/abs\/2401.00625"},{"key":"e_1_3_1_5_2","unstructured":"Colby Banbury Vijay Janapa Reddi Peter Torelli Jeremy Holleman Nat Jeffries Csaba Kiraly Pietro Montino David Kanter Sebastian Ahmed Danilo Pau Urmish Thakker Antonio Torrini Peter Warden Jay Cordaro Giuseppe Di Guglielmo Javier Duarte Stephen Gibellini Videet Parekh Honson Tran Nhan Tran Niu Wenxu and Xu Xuesong. 2021. Mlperf tiny benchmark. arXiv:2106.07597. Retrieved from https:\/\/arxiv.org\/abs\/2106.07597"},{"key":"e_1_3_1_6_2","volume-title":"https:\/\/github.com\/suzgunmirac\/BIG-Bench-Hard","unstructured":"BIG-Bench-Hard. Visited in 2024. Retrieved from https:\/\/github.com\/suzgunmirac\/BIG-Bench-Hard"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICWS62655.2024.00099"},{"key":"e_1_3_1_8_2","article-title":"FlexQuant: Elastic quantization framework for locally hosted LLM on edge devices","author":"Chai Yuji","year":"2025","unstructured":"Yuji Chai, Mujin Kwen, David Brooks, and Gu-Yeon Wei. 2025. FlexQuant: Elastic quantization framework for locally hosted LLM on edge devices. arXiv:2501.07139. Retrieved from https:\/\/arxiv.org\/abs\/2501.07139","journal-title":"arXiv:2501.07139."},{"key":"e_1_3_1_9_2","unstructured":"Mark Chen Jerry Tworek and etc. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_1_10_2","volume-title":"Retrieved from https:\/\/www.tau-nlp.sites.tau.ac.il\/commonsenseqa","unstructured":"CommonsenseQA. Visited in 2024. Retrieved from https:\/\/www.tau-nlp.sites.tau.ac.il\/commonsenseqa"},{"key":"e_1_3_1_11_2","article-title":"A performance evaluation of a quantized large language model on various smartphones","author":"\u00c7\u00f6pl\u00fc Tolga","year":"2023","unstructured":"Tolga \u00c7\u00f6pl\u00fc, Marc Loedi, Arto Bendiken, Mykhailo Makohin, Joshua J Bouw, and Stephen Cobb. 2023. A performance evaluation of a quantized large language model on various smartphones. arXiv:2312.12472. Retrieved from https:\/\/arxiv.org\/abs\/2312.12472","journal-title":"arXiv:2312.12472."},{"key":"e_1_3_1_12_2","unstructured":"NVIDIA Corporation. 2023. NVIDIA System Management Interface. Retrieved from https:\/\/docs.nvidia.com\/deploy\/nvidia-smi\/index.html"},{"key":"e_1_3_1_13_2","volume-title":"Proceedings of the NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly","author":"Deng Chunyuan","year":"2023","unstructured":"Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023. Benchmark probing: Investigating data leakage in large language models. In Proceedings of the NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly."},{"key":"e_1_3_1_14_2","first-page":"30318","article-title":"Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale","volume":"35","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022), 30318\u201330332.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_15_2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805","journal-title":"arXiv:1810.04805."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.240320"},{"key":"e_1_3_1_17_2","volume-title":"https:\/\/github.com\/ejhusom\/neurips-edge-llm-challenge%-sampled\/","author":"Repo Experiment","unstructured":"Experiment Repo. Visited in 2025. Retrieved from https:\/\/github.com\/ejhusom\/neurips-edge-llm-challenge%-sampled\/"},{"key":"e_1_3_1_18_2","article-title":"Gptq: Accurate post-training quantization for generative pre-trained transformers","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323. Retrieved from https:\/\/arxiv.org\/abs\/2210.17323","journal-title":"arXiv:2210.17323."},{"key":"e_1_3_1_19_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. OPTQ: Accurate quantization for generative pre-trained transformers. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/OJCOMS.2024.3456549"},{"key":"e_1_3_1_21_2","article-title":"llama.cpp","unstructured":"ggml-org. Visited in 2025. llama.cpp. Retrieved from https:\/\/github.com\/ggml-org\/llama.cpp","journal-title":"Retrieved from https:\/\/github.com\/ggml-org\/llama.cpp"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1201\/9781003162810-13"},{"key":"e_1_3_1_23_2","volume-title":"Retrieved from https:\/\/github.com\/openai\/grade-school-math","unstructured":"GSM8K. Visited in 2024. Retrieved from https:\/\/github.com\/openai\/grade-school-math"},{"key":"e_1_3_1_24_2","article-title":"Designing efficient LLM accelerators for edge devices","author":"Haris Jude","year":"2024","unstructured":"Jude Haris, Rappy Saha, Wenhao Hu, and Jos\u00e9 Cano. 2024. Designing efficient LLM accelerators for edge devices. arXiv:2408.00462. Retrieved from https:\/\/arxiv.org\/abs\/2408.00462","journal-title":"arXiv:2408.00462."},{"key":"e_1_3_1_25_2","article-title":"Optimizing large language models through quantization: A comparative analysis of PTQ and QAT techniques","author":"Hasan Jahid","year":"2024","unstructured":"Jahid Hasan. 2024. Optimizing large language models through quantization: A comparative analysis of PTQ and QAT techniques. arXiv:2411.06084. Retrieved from https:\/\/arxiv.org\/abs\/2411.06084","journal-title":"arXiv:2411.06084."},{"key":"e_1_3_1_26_2","article-title":"I-LLM: Efficient integer-only inference for fully-quantized low-bit large language models","author":"Hu Xing","year":"2024","unstructured":"Xing Hu, Yuan Chen, Dawei Yang, Sifan Zhou, Zhihang Yuan, Jiangyong Yu, and Chen Xu. 2024. I-LLM: Efficient integer-only inference for fully-quantized low-bit large language models. arXiv:2405.17849. Retrieved from https:\/\/arxiv.org\/abs\/2405.17849","journal-title":"arXiv:2405.17849."},{"key":"e_1_3_1_27_2","article-title":"Billm: Pushing the limit of post-training quantization for llms","author":"Huang Wei","year":"2024","unstructured":"Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. 2024. Billm: Pushing the limit of post-training quantization for llms. arXiv:2402.04291. Retrieved from https:\/\/arxiv.org\/abs\/2402.04291","journal-title":"arXiv:2402.04291."},{"key":"e_1_3_1_28_2","article-title":"How good are low-bit quantized llama3 models? An empirical study","author":"Huang Wei","year":"2024","unstructured":"Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, and Michele Magno. 2024. How good are low-bit quantized llama3 models? An empirical study. arXiv:2404.14047. Retrieved from https:\/\/arxiv.org\/abs\/2404.14047","journal-title":"arXiv:2404.14047."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s44267-024-00070-x"},{"key":"e_1_3_1_30_2","volume-title":"Retrieved from https:\/\/github.com\/openai\/human-eval","unstructured":"HumanEval. Visited in 2024. Retrieved from https:\/\/github.com\/openai\/human-eval"},{"key":"e_1_3_1_31_2","article-title":"The price of prompting: Profiling energy use in large language models inference","author":"Husom Erik Johannes","year":"2024","unstructured":"Erik Johannes Husom, Arda Goknil, Lwin Khin Shar, and Sagar Sen. 2024. The price of prompting: Profiling energy use in large language models inference. arXiv:2407.16893. Retrieved from https:\/\/arxiv.org\/abs\/2407.16893","journal-title":"arXiv:2407.16893."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00447"},{"key":"e_1_3_1_33_2","article-title":"Compressing llms: The truth is rarely pure and never simple","author":"Jaiswal Ajay","year":"2023","unstructured":"Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, and Yinfei Yang. 2023. Compressing llms: The truth is rarely pure and never simple. arXiv:2310.01382. Retrieved from https:\/\/arxiv.org\/abs\/2310.01382","journal-title":"arXiv:2310.01382."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2024.3427421"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.726"},{"key":"e_1_3_1_36_2","volume-title":"Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/project\/joul%emeter-computational-energy-measurement-and-optimization\/","unstructured":"Joulemeter. Visited in 2025. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/project\/joul%emeter-computational-energy-measurement-and-optimization\/"},{"key":"e_1_3_1_37_2","volume-title":"Retrieved from https:\/\/www.joulescope.com\/","unstructured":"Joulescope. Visited in 2025. Retrieved from https:\/\/www.joulescope.com\/"},{"key":"e_1_3_1_38_2","article-title":"Decentralized LLM inference over edge networks with energy harvesting","author":"Khoshsirat Aria","year":"2024","unstructured":"Aria Khoshsirat, Giovanni Perin, and Michele Rossi. 2024. Decentralized LLM inference over edge networks with energy harvesting. arXiv:2408.15907. Retrieved from https:\/\/arxiv.org\/abs\/2408.15907","journal-title":"arXiv:2408.15907."},{"key":"e_1_3_1_39_2","article-title":"A comprehensive study on quantization techniques for large language models","author":"Lang Jiedong","year":"2024","unstructured":"Jiedong Lang, Zhehao Guo, and Shuyu Huang. 2024. A comprehensive study on quantization techniques for large language models. arXiv:2411.02530. Retrieved from https:\/\/arxiv.org\/abs\/2411.02530","journal-title":"arXiv:2411.02530."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00655"},{"key":"e_1_3_1_41_2","article-title":"Evaluating quantized large language models","author":"Li Shiyao","year":"2024","unstructured":"Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. 2024. Evaluating quantized large language models. arXiv:2402.18158. Retrieved from https:\/\/arxiv.org\/abs\/2402.18158","journal-title":"arXiv:2402.18158."},{"key":"e_1_3_1_42_2","article-title":"PalmBench: A comprehensive benchmark of compressed large language models on mobile platforms","author":"Li Yilong","year":"2024","unstructured":"Yilong Li, Jingyu Liu, Hao Zhang, M Badri Narayanan, Utkarsh Sharma, Shuai Zhang, Pan Hu, Yijing Zeng, Jayaram Raghuram, and Suman Banerjee. 2024. PalmBench: A comprehensive benchmark of compressed large language models on mobile platforms. arXiv:2410.05315. Retrieved from https:\/\/arxiv.org\/abs\/2410.05315","journal-title":"arXiv:2410.05315."},{"key":"e_1_3_1_43_2","article-title":"TPI-LLM: Serving 70B-scale LLMs efficiently on low-resource edge devices","author":"Li Zonghang","year":"2024","unstructured":"Zonghang Li, Wenjiao Feng, Mohsen Guizani, and Hongfang Yu. 2024. TPI-LLM: Serving 70B-scale LLMs efficiently on low-resource edge devices. arXiv:2410.00531. Retrieved from https:\/\/arxiv.org\/abs\/2410.00531","journal-title":"arXiv:2410.00531."},{"key":"e_1_3_1_44_2","article-title":"Arb-llm: Alternating refined binarizations for large language models","author":"Li Zhiteng","year":"2024","unstructured":"Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, Linghe Kong, Yulun Zhang, and Xiaokang Yang2024. Arb-llm: Alternating refined binarizations for large language models. arXiv:2410.03129. Retrieved from https:\/\/arxiv.org\/abs\/2410.03129","journal-title":"arXiv:2410.03129."},{"key":"e_1_3_1_45_2","first-page":"74","volume-title":"Proceedings of the Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 74\u201381."},{"key":"e_1_3_1_46_2","first-page":"87","article-title":"AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration","volume":"6","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87\u2013100.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_47_2","article-title":"Do emergent abilities exist in quantized large language models: An empirical study","author":"Liu Peiyu","year":"2023","unstructured":"Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2023. Do emergent abilities exist in quantized large language models: An empirical study. arXiv:2307.08072. Retrieved from https:\/\/arxiv.org\/abs\/2307.08072","journal-title":"arXiv:2307.08072."},{"key":"e_1_3_1_48_2","volume-title":"Proceedings of the NeurIPS 2024 Competition Track","author":"Liu Shiwei","year":"2024","unstructured":"Shiwei Liu, Kai Han, Adriana Fernandez-Lopez, Ajay Kumar Jaiswal, Zahra Atashgahi, Boqian Wu, Edoardo Ponti, Callie Hao, Rebekka Burkholz, Olga Saukh, Lu Yin, Andreas Zinonos, Tianjin Huang, Jared Tanner, and Yunhe Wang. 2024. Edge-LLMs: Edge-device large language model competition. In Proceedings of the NeurIPS 2024 Competition Track."},{"key":"e_1_3_1_49_2","article-title":"Evaluating the generalization ability of quantized llms: Benchmark, analysis, and toolbox","author":"Liu Yijun","year":"2024","unstructured":"Yijun Liu, Yuan Meng, Fang Wu, Shenhao Peng, Hang Yao, Chaoyu Guan, Chen Tang, Xinzhu Ma, Zhi Wang, and Wenwu Zhu. 2024. Evaluating the generalization ability of quantized llms: Benchmark, analysis, and toolbox. arXiv:2406.12928. Retrieved from https:\/\/arxiv.org\/abs\/2406.12928","journal-title":"arXiv:2406.12928."},{"key":"e_1_3_1_50_2","volume-title":"Retrieved from https:\/\/github.com\/ggerganov\/llama.cpp","unstructured":"Llama.cpp. Visited in 2024. Retrieved from https:\/\/github.com\/ggerganov\/llama.cpp"},{"key":"e_1_3_1_51_2","article-title":"On the compressibility of quantized large language models","author":"Mao Yu","year":"2024","unstructured":"Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, and Chun Jason Xue. 2024. On the compressibility of quantized large language models. arXiv:2403.01384. Retrieved from https:\/\/arxiv.org\/abs\/2403.01384","journal-title":"arXiv:2403.01384."},{"key":"e_1_3_1_52_2","volume-title":"Retrieved from https:\/\/www.msoon.com\/high-voltage-power-monitor\/","unstructured":"Monsoon. Visited in 2025. Retrieved from https:\/\/www.msoon.com\/high-voltage-power-monitor\/"},{"key":"e_1_3_1_53_2","volume-title":"Retrieved from https:\/\/github.com\/ollama\/ollama","author":"framework Ollama","unstructured":"Ollama framework. Visited in 2025. Retrieved from https:\/\/github.com\/ollama\/ollama"},{"key":"e_1_3_1_54_2","volume-title":"Retrieved from https:\/\/ollama.com\/library","author":"Library Ollama","unstructured":"Ollama Library. Visited in 2025. Retrieved from https:\/\/ollama.com\/library"},{"key":"e_1_3_1_55_2","article-title":"Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models","author":"Park Gunho","year":"2022","unstructured":"Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. 2022. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv:2206.09557. Retrieved from https:\/\/arxiv.org\/abs\/2206.09557","journal-title":"arXiv:2206.09557."},{"key":"e_1_3_1_56_2","article-title":"PyJoules: Python-based energy measurement library","unstructured":"PowerAPI. Visited in 2025. PyJoules: Python-based energy measurement library. Retrieved from https:\/\/github.com\/powerapi-ng\/pyJoules","journal-title":"Retrieved from https:\/\/github.com\/powerapi-ng\/pyJoules"},{"key":"e_1_3_1_57_2","volume-title":"Proceedings of the 41st International Conference on Machine Learning (ICML\u201924)","author":"Qin Haotong","year":"2024","unstructured":"Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, and Michele Magno. 2024. Accurate LoRA-finetuning quantization of LLMs via information retention. In Proceedings of the 41st International Conference on Machine Learning (ICML\u201924). Article 1687, 19 pages."},{"key":"e_1_3_1_58_2","article-title":"Empirical guidelines for deploying LLMs onto resource-constrained edge devices","author":"Qin Ruiyang","year":"2024","unstructured":"Ruiyang Qin, Dancheng Liu, Zheyu Yan, Zhaoxuan Tan, Zixuan Pan, Zhenge Jia, Meng Jiang, Ahmed Abbasi, Jinjun Xiong, and Yiyu Shi. 2024. Empirical guidelines for deploying LLMs onto resource-constrained edge devices. arXiv:2406.03777. Retrieved from https:\/\/arxiv.org\/abs\/2406.03777","journal-title":"arXiv:2406.03777."},{"key":"e_1_3_1_59_2","article-title":"Mobile edge intelligence for large language models: A contemporary survey","author":"Qu Guanqiao","year":"2024","unstructured":"Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. 2024. Mobile edge intelligence for large language models: A contemporary survey. arXiv:2407.18921. Retrieved from https:\/\/arxiv.org\/abs\/2407.18921","journal-title":"arXiv:2407.18921."},{"key":"e_1_3_1_60_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLA58977.2023.00104"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"e_1_3_1_63_2","volume-title":"Retrieved from https:\/\/github.com\/hubblo-org\/scaphandre","unstructured":"Scaphandre. Visited in 2025. Retrieved from https:\/\/github.com\/hubblo-org\/scaphandre"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i17.29860"},{"key":"e_1_3_1_65_2","article-title":"Edgeqat: Entropy and distribution guided quantization-aware training for the acceleration of lightweight llms on the edge","author":"Shen Xuan","year":"2024","unstructured":"Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, and Yanzhi Wang. 2024. Edgeqat: Entropy and distribution guided quantization-aware training for the acceleration of lightweight llms on the edge. arXiv:2402.10787. Retrieved from https:\/\/arxiv.org\/abs\/2402.10787","journal-title":"arXiv:2402.10787."},{"key":"e_1_3_1_66_2","first-page":"31094","volume-title":"Proceedings of the ICML\u201923","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the ICML\u201923. 31094\u201331116."},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3639475.3640097"},{"key":"e_1_3_1_68_2","article-title":"Mariogpt: Open-ended text2level generation through large language models","author":"Sudhakaran Shyam","year":"2023","unstructured":"Shyam Sudhakaran, Miguel Gonz\u00e1lez-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi. 2023. Mariogpt: Open-ended text2level generation through large language models. Advances in Neural Information Processing Systems 36 (2023), 54123\u201354227.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_69_2","article-title":"Mobilebert: A compact task-agnostic bert for resource-limited devices","author":"Sun Zhiqing","year":"2020","unstructured":"Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv:2004.02984. Retrieved from https:\/\/arxiv.org\/abs\/2004.02984","journal-title":"arXiv:2004.02984."},{"key":"e_1_3_1_70_2","article-title":"Mobilequant: Mobile-friendly quantization for on-device language models","author":"Tan Fuwen","year":"2024","unstructured":"Fuwen Tan, Royson Lee, \u0141ukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, and Brais Martinez. 2024. Mobilequant: Mobile-friendly quantization for on-device language models. arXiv:2408.13933. Retrieved from https:\/\/arxiv.org\/abs\/2408.13933","journal-title":"arXiv:2408.13933."},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQoS61813.2024.10682928"},{"key":"e_1_3_1_72_2","article-title":"Llama: Open and efficient foundation language models","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971","journal-title":"arXiv:2302.13971."},{"key":"e_1_3_1_73_2","volume-title":"Retrieved from https:\/\/github.com\/sylinrl\/TruthfulQA","unstructured":"TruthfulQA. Visited in 2024. Retrieved from https:\/\/github.com\/sylinrl\/TruthfulQA"},{"key":"e_1_3_1_74_2","article-title":"MLPerf power: Benchmarking the energy efficiency of machine learning systems from microwatts to megawatts for sustainable AI","author":"Tschand Arya","year":"2024","unstructured":"Arya Tschand, Arun Tejusve Raghunath Rajan, Sachin Idgunji, Anirban Ghosh, Jeremy Holleman, Csaba Kiraly, Pawan Ambalkar, Ritika Borkar, Ramesh Chukka, Trevor Cockrell, Oliver Curtis, Grigori Fursin, Miro Hodak, Hiwot Kassa, Anton Lokhmotov, Dejan Miskovic, Yuechao Pan, Manu Prasad Manmathan, Liz Raymond, Tom St. John, Arjun Suresh, Rowan Taubitz, Sean Zhan, Scott Wasson, David Kanter, and Vijay Janapa Reddi. 2024. MLPerf power: Benchmarking the energy efficiency of machine learning systems from microwatts to megawatts for sustainable AI. arXiv:2410.12032. Retrieved from https:\/\/arxiv.org\/abs\/2410.12032","journal-title":"arXiv:2410.12032."},{"key":"e_1_3_1_75_2","volume-title":"Retrieved from https:\/\/joy-it.net\/en\/products\/JT-TC66C","author":"Volt-\/Amperemeter USB","unstructured":"USB Volt-\/Amperemeter. Visited in 2025. Retrieved from https:\/\/joy-it.net\/en\/products\/JT-TC66C"},{"key":"e_1_3_1_76_2","first-page":"2832","volume-title":"Proceedings of the SIGIR \u201924","author":"Schaik Tempest A. van","year":"2024","unstructured":"Tempest A. van Schaik and Brittany Pugh. 2024. A field guide to automatic evaluation of LLM-generated summaries. In Proceedings of the SIGIR \u201924. ACM, 2832\u20132836."},{"key":"e_1_3_1_77_2","article-title":"Model compression and efficient inference for large language models: A survey","author":"Wang Wenxiao","year":"2024","unstructured":"Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, and Xiaofei He. 2024. Model compression and efficient inference for large language models: A survey. arXiv:2402.09748. Retrieved from https:\/\/arxiv.org\/abs\/2402.09748","journal-title":"arXiv:2402.09748."},{"key":"e_1_3_1_78_2","article-title":"T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge","author":"Wei Jianyu","year":"2024","unstructured":"Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, and Mao Yang. 2024. T-mac: Cpu renaissance via table lookup for low-bit llm deployment on edge. arXiv:2407.00088. Retrieved from https:\/\/arxiv.org\/abs\/2407.00088","journal-title":"arXiv:2407.00088."},{"key":"e_1_3_1_79_2","first-page":"17402","article-title":"Outlier suppression: Pushing the limit of low-bit transformer language models","volume":"35","author":"Wei Xiuying","year":"2022","unstructured":"Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. 2022. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems 35 (2022), 17402\u201317414.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_80_2","article-title":"Exploring the potential of large language models for automation in technical customer service","author":"Wulf Jochen","year":"2024","unstructured":"Jochen Wulf and Juerg Meierhofer. 2024. Exploring the potential of large language models for automation in technical customer service. arXiv:2405.09161. Retrieved from https:\/\/arxiv.org\/abs\/2405.09161","journal-title":"arXiv:2405.09161."},{"key":"e_1_3_1_81_2","first-page":"38087","volume-title":"Proceedings of the ICML\u201923","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the ICML\u201923. 38087\u201338099."},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/COMST.2024.3353265"},{"key":"e_1_3_1_83_2","first-page":"27168","article-title":"Zeroquant: Efficient and affordable post-training quantization for large-scale transformers","volume":"35","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022), 27168\u201327183.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_84_2","article-title":"Llm as a system service on mobile devices","author":"Yin Wangsong","year":"2024","unstructured":"Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. Llm as a system service on mobile devices. arXiv:2403.11805. Retrieved from https:\/\/arxiv.org\/abs\/2403.11805","journal-title":"arXiv:2403.11805."},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649329.3658473"},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWCMC61514.2024.10592339"},{"key":"e_1_3_1_87_2","article-title":"EdgeShard: Efficient LLM inference via collaborative edge computing","author":"Zhang Mingjin","year":"2024","unstructured":"Mingjin Zhang, Jiannong Cao, Xiaoming Shen, and Zeyang Cui. 2024. EdgeShard: Efficient LLM inference via collaborative edge computing. arXiv:2405.14371. Retrieved from https:\/\/arxiv.org\/abs\/2405.14371","journal-title":"arXiv:2405.14371."},{"key":"e_1_3_1_88_2","article-title":"Can chatgpt understand too? A comparative study on chatgpt and fine-tuned bert","author":"Zhong Qihuang","year":"2023","unstructured":"Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Can chatgpt understand too? A comparative study on chatgpt and fine-tuned bert. arXiv:2302.10198. Retrieved from https:\/\/arxiv.org\/abs\/2302.10198","journal-title":"arXiv:2302.10198."},{"key":"e_1_3_1_89_2","article-title":"A survey on efficient inference for large language models","author":"Zhou Zixuan","year":"2024","unstructured":"Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. 2024. A survey on efficient inference for large language models. arXiv:2404.14294. Retrieved from https:\/\/arxiv.org\/abs\/2404.14294","journal-title":"arXiv:2404.14294."}],"container-title":["ACM Transactions on Internet of Things"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3767742","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T21:23:35Z","timestamp":1763501015000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3767742"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,18]]},"references-count":88,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1145\/3767742"],"URL":"https:\/\/doi.org\/10.1145\/3767742","relation":{},"ISSN":["2691-1914","2577-6207"],"issn-type":[{"value":"2691-1914","type":"print"},{"value":"2577-6207","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,18]]},"assertion":[{"value":"2025-02-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-25","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}