{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T16:04:53Z","timestamp":1764777893900,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":68,"publisher":"ACM","funder":[{"name":"Toyota Motor North America R&D Infotech Labs"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,12,3]]},"DOI":"10.1145\/3769102.3770614","type":"proceedings-article","created":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T16:00:41Z","timestamp":1764777641000},"page":"1-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8732-6200","authenticated-orcid":false,"given":"Haoxin","family":"Wang","sequence":"first","affiliation":[{"name":"Computer Science, Georgia State University, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-9396-5383","authenticated-orcid":false,"given":"Xiaolong","family":"Tu","sequence":"additional","affiliation":[{"name":"Computer Science, Georgia State University, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-5653-9814","authenticated-orcid":false,"given":"Hongyu","family":"Ke","sequence":"additional","affiliation":[{"name":"Computer Science, Georgia State University, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6255-169X","authenticated-orcid":false,"given":"Huirong","family":"Chai","sequence":"additional","affiliation":[{"name":"Computer Science, Georgia State University, Atlanta, GA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4162-1423","authenticated-orcid":false,"given":"Dawei","family":"Chen","sequence":"additional","affiliation":[{"name":"Infotech Labs, Toyota Motor North America R&amp;D, Mountain View, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8291-5025","authenticated-orcid":false,"given":"Kyungtae","family":"Han","sequence":"additional","affiliation":[{"name":"Infotech Labs, Toyota Motor North America R&amp;D, Mountain View, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,12,3]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877\u20131901, 2020.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_2_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023."},{"key":"e_1_3_2_1_3_1","volume-title":"et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023."},{"key":"e_1_3_2_1_4_1","volume-title":"Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948","author":"Guo Daya","year":"2025","unstructured":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025."},{"key":"e_1_3_2_1_5_1","volume-title":"Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437","author":"Liu Aixin","year":"2024","unstructured":"Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024."},{"key":"e_1_3_2_1_6_1","first-page":"4186","volume-title":"Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171\u20134186, 2019."},{"issue":"8","key":"e_1_3_2_1_7_1","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.","journal-title":"OpenAI blog"},{"key":"e_1_3_2_1_8_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021."},{"issue":"240","key":"e_1_3_2_1_9_1","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1\u2013113, 2023.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_10_1","volume-title":"Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_3_2_1_11_1","volume-title":"Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318","author":"Chen Charlie","year":"2023","unstructured":"Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023."},{"key":"e_1_3_2_1_12_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with ioawareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. Flashattention: Fast and memory-efficient exact attention with ioawareness. Advances in Neural Information Processing Systems, 35:16344\u201316359, 2022.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_13_1","volume-title":"The future of consumer edge-ai computing","author":"Laskaridis Stefanos","year":"2024","unstructured":"Stefanos Laskaridis, Stylianos I Venieris, Alexandros Kouris, Rui Li, and Nicholas D Lane. The future of consumer edge-ai computing. IEEE Pervasive Computing, 2024."},{"key":"e_1_3_2_1_14_1","unstructured":"OpenAI ChatGPT. https:\/\/openai.com\/index\/chatgpt\/. Accessed on June 2025."},{"key":"e_1_3_2_1_15_1","unstructured":"Google NotebookLM. https:\/\/notebooklm.google.com\/. Accessed on June 2025."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMC.2022.3179943"},{"key":"e_1_3_2_1_17_1","first-page":"1388","volume-title":"Proc. IEEE INFOCOM","author":"Wang Haoxin","year":"2020","unstructured":"Haoxin Wang and Jiang Xie. User preference based energy-aware mobile AR system with edge computing. In Proc. IEEE INFOCOM, pages 1379\u20131388, 2020."},{"key":"e_1_3_2_1_18_1","first-page":"6","volume-title":"Proc. IEEE ICC","author":"Wang Haoxin","year":"2017","unstructured":"Haoxin Wang, Jiang Xie, and Tao Han. V-handoff: A practical energy efficient handoff for 802.11 infrastructure networks. In Proc. IEEE ICC, pages 1\u20136, 2017."},{"key":"e_1_3_2_1_19_1","volume-title":"Carboncp: Carbon-aware dnn partitioning with conformal prediction for sustainable edge intelligence. arXiv preprint arXiv:2404.16970","author":"Ke Hongyu","year":"2024","unstructured":"Hongyu Ke, Wanxin Jin, and Haoxin Wang. Carboncp: Carbon-aware dnn partitioning with conformal prediction for sustainable edge intelligence. arXiv preprint arXiv:2404.16970, 2024."},{"key":"e_1_3_2_1_20_1","volume-title":"Forty-first International Conference on Machine Learning","author":"Liu Zechun","year":"2024","unstructured":"Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. In Forty-first International Conference on Machine Learning, 2024."},{"key":"e_1_3_2_1_21_1","volume-title":"Nvidia H100 tensor core gpu. https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/","author":"NVIDIA Corporation","year":"2025","unstructured":"NVIDIA Corporation. Nvidia H100 tensor core gpu. https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/, 2025. Accessed: Feb. 2025."},{"key":"e_1_3_2_1_22_1","volume-title":"Meta will have 600k h100-equivalent gpus. https:\/\/x.com\/soumithchintala\/status\/1748074223187173724","author":"Chintala Soumith","year":"2024","unstructured":"Soumith Chintala. Meta will have 600k h100-equivalent gpus. https:\/\/x.com\/soumithchintala\/status\/1748074223187173724, 2024. Accessed: 2025-02-01."},{"key":"e_1_3_2_1_23_1","unstructured":"Health Insurance Portability and Accountability Act of 1996 (HIPAA). https:\/\/www.cdc.gov\/phlp\/php\/resources\/health-insurance-portability-and-accountability-act-of-1996-hipaa.html#:~:text=At%20a%20glance Rule%20to%20implement%20HIPAA%20requirements. Accessed on June 2025."},{"issue":"1","key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1109\/MCAS.2024.3476008","article-title":"Collaborative hardware and software design in the era of large language models","volume":"25","author":"Guo Cong","year":"2025","unstructured":"Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai Li, and Yiran Chen. A survey: Collaborative hardware and software design in the era of large language models. IEEE Circuits and Systems Magazine, 25(1):35\u201357, 2025.","journal-title":"IEEE Circuits and Systems Magazine"},{"key":"e_1_3_2_1_25_1","volume-title":"Qwen technical report. arXiv preprint arXiv:2309.16609","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023."},{"key":"e_1_3_2_1_26_1","first-page":"93","volume-title":"Proc. the 19th Annual International Conference on Mobile Systems, Applications, and Services","author":"Zhang Li Lyna","year":"2021","unstructured":"Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In Proc. the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 81\u201393, 2021."},{"key":"e_1_3_2_1_27_1","first-page":"1477","volume-title":"Proc. 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Feng Chengquan","year":"2024","unstructured":"Chengquan Feng, Li Lyna Zhang, Yuanchi Liu, Jiahang Xu, Chengruidong Zhang, Zhiyuan Wang, Ting Cao, Mao Yang, and Haisheng Tan. LitePred: Transferable and scalable latency prediction for hardware-aware neural architecture search. In Proc. 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1463\u20131477, 2024."},{"key":"e_1_3_2_1_28_1","first-page":"93","volume-title":"Proc. the Eighth ACM\/IEEE Symposium on Edge Computing","author":"Tu Xiaolong","year":"2023","unstructured":"Xiaolong Tu, Anik Mallik, Dawei Chen, Kyungtae Han, Onur Altintas, Haoxin Wang, and Jiang Xie. Unveiling energy efficiency in deep learning: Measurement, prediction, and scoring across edge devices. In Proc. the Eighth ACM\/IEEE Symposium on Edge Computing, pages 80\u201393, 2023."},{"key":"e_1_3_2_1_29_1","first-page":"959","volume-title":"Proc. IEEE ICC","author":"Mallik Anik","year":"2023","unstructured":"Anik Mallik, Haoxin Wang, Jiang Xie, Dawei Chen, and Kyungtae Han. EPAM: A predictive energy model for mobile AI. In Proc. IEEE ICC, pages 954\u2013959, 2023."},{"key":"e_1_3_2_1_30_1","volume-title":"Proc. Tackling Climate Change with Machine Learning: Workshop at NeurIPS 2023","author":"Tu Xiaolong","year":"2023","unstructured":"Xiaolong Tu, Anik Mallik, Haoxin Wang, and Jiang Xie. Deepen2023: Energy datasets for edge artificial intelligence. In Proc. Tackling Climate Change with Machine Learning: Workshop at NeurIPS 2023, 2023."},{"key":"e_1_3_2_1_31_1","first-page":"12","volume-title":"Proc. the 26th International Workshop on Mobile Computing Systems and Applications","author":"Tu Xiaolong","year":"2025","unstructured":"Xiaolong Tu, Dawei Chen, Kyungtae Han, Onur Altintas, and Haoxin Wang. Greenauto: An automated platform for sustainable AI model design on edge devices. In Proc. the 26th International Workshop on Mobile Computing Systems and Applications, pages 7\u201312, 2025."},{"key":"e_1_3_2_1_32_1","unstructured":"Leo Gao Jonathan Tow Baber Abbasi Stella Biderman Sid Black Anthony DiPofi Charles Foster Laurence Golding Jeffrey Hsu Alain Le Noac'h Haonan Li Kyle McDonell Niklas Muennighoff Chris Ociepa Jason Phang Laria Reynolds Hailey Schoelkopf Aviya Skowron Lintang Sutawika Eric Tang Anish Thite Ben Wang Kevin Wang and Andy Zou. The language model evaluation harness 07 2024."},{"key":"e_1_3_2_1_33_1","unstructured":"Android GPU Inspector (AGI). https:\/\/developer.android.com\/agi. Accessed on June 2025."},{"key":"e_1_3_2_1_34_1","unstructured":"Tracing SDK. https:\/\/perfetto.dev\/docs\/instrumentation\/tracing-sdk. Accessed on June 2025."},{"key":"e_1_3_2_1_35_1","first-page":"907","volume-title":"Proc. the 30th Annual International Conference on Mobile Computing and Networking (MobiCom)","author":"Laskaridis Stefanos","year":"2024","unstructured":"Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi. Melting point: Mobile evaluation of language transformers. In Proc. the 30th Annual International Conference on Mobile Computing and Networking (MobiCom), pages 890\u2013907, 2024."},{"key":"e_1_3_2_1_36_1","first-page":"137","volume-title":"Proc. the 21st ACM Conference on Embedded Networked Sensor Systems","author":"Chu Haolin","year":"2023","unstructured":"Haolin Chu, Xiaolong Zheng, Liang Liu, and Huadong Ma. nnPerf: Demystifying dnn runtime inference latency on mobile platforms. In Proc. the 21st ACM Conference on Embedded Networked Sensor Systems, page 125\u2013137, 2023."},{"key":"e_1_3_2_1_37_1","unstructured":"TFLite Model Benchmark Tool. https:\/\/github.com\/sourcecode369\/tensorflow-1\/blob\/master\/tensorflow\/lite\/tools\/benchmark\/README.md. Accessed on June 2025."},{"key":"e_1_3_2_1_38_1","unstructured":"MLC LLM: Universal LLM Deployment Engine With ML Compilation. https:\/\/llm.mlc.ai\/. Accessed on June 2025."},{"key":"e_1_3_2_1_39_1","unstructured":"TVM: open deep learning compiler stack. https:\/\/github.com\/apache\/tvm. Accessed on June 2025."},{"key":"e_1_3_2_1_40_1","unstructured":"LiteRT overview. https:\/\/ai.google.dev\/edge\/litert. Accessed on June 2025."},{"key":"e_1_3_2_1_41_1","unstructured":"Monsoon Power Monitor. https:\/\/www.msoon.com\/specifications. Accessed on June 2025."},{"key":"e_1_3_2_1_42_1","unstructured":"Open Neural Network Exchange. https:\/\/onnx.ai\/. Accessed on June 2025."},{"key":"e_1_3_2_1_43_1","unstructured":"Open-source software toolkit for optimizing and deploying deep learning models. https:\/\/github.com\/openvinotoolkit\/openvino. Accessed on June 2025."},{"key":"e_1_3_2_1_44_1","first-page":"2430","volume-title":"Proc. International Conference on Machine Learning","author":"Biderman Stella","unstructured":"Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In Proc. International Conference on Machine Learning, pages 2397\u20132430. PMLR, 2023."},{"key":"e_1_3_2_1_45_1","unstructured":"Smol Models. https:\/\/github.com\/huggingface\/smollm. Accessed on June 2025."},{"key":"e_1_3_2_1_46_1","volume-title":"Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv:1803.05457v1","author":"Clark Peter","year":"2018","unstructured":"Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv:1803.05457v1, 2018."},{"key":"e_1_3_2_1_47_1","volume-title":"Proc. the 57th Annual Meeting of the Association for Computational Linguistics","author":"Zellers Rowan","year":"2019","unstructured":"Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, 2019."},{"key":"e_1_3_2_1_48_1","volume-title":"Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168","author":"Cobbe Karl","year":"2021","unstructured":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021."},{"key":"e_1_3_2_1_49_1","first-page":"134","volume-title":"Proc. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in Ilm inference with sarathiserve. In Proc. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117\u2013134, 2024."},{"key":"e_1_3_2_1_50_1","first-page":"19286","volume-title":"Proc. International Conference on Machine Learning","author":"Leviathan Yaniv","unstructured":"Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proc. International Conference on Machine Learning, pages 19274\u201319286. PMLR, 2023."},{"key":"e_1_3_2_1_51_1","first-page":"626","volume-title":"Proc. the 29th Symposium on Operating Systems Principles","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proc. the 29th Symposium on Operating Systems Principles, pages 611\u2013626, 2023."},{"key":"e_1_3_2_1_52_1","volume-title":"Oliver Hausd\u00f6rfer, and Alok Verma. Communication compression for tensor parallel LLM inference. arXiv preprint arXiv:2411.09510","author":"Hansen-Palmus Jan","year":"2024","unstructured":"Jan Hansen-Palmus, Michael Truong Le, Oliver Hausd\u00f6rfer, and Alok Verma. Communication compression for tensor parallel LLM inference. arXiv preprint arXiv:2411.09510, 2024."},{"key":"e_1_3_2_1_53_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020."},{"key":"e_1_3_2_1_54_1","unstructured":"Loubna Ben Allal Anton Lozhkov Elie Bakouch Gabriel Mart\u00edn Bl\u00e1zquez Guilherme Penedo Lewis Tunstall Andr\u00e9s Marafioti Hynek Kydl\u00ed\u010dek Agust\u00edn Piqueres Lajar\u00edn Vaibhav Srivastav et al. SmolLM2: When smol goes big-data-centric training of a small language model. arXiv preprint arXiv:2502.02737 2025."},{"key":"e_1_3_2_1_55_1","first-page":"6","volume-title":"Proc. the Workshop on Edge and Mobile Foundation Models","author":"Li Xiang","year":"2024","unstructured":"Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, and Mengwei Xu. Large language models on mobile devices: Measurements, analysis, and insights. In Proc. the Workshop on Edge and Mobile Foundation Models, pages 1\u20136, 2024."},{"key":"e_1_3_2_1_56_1","article-title":"Fast on-device llm inference with speculative decoding","author":"Xu Daliang","year":"2024","unstructured":"Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. Edgellm: Fast on-device llm inference with speculative decoding. IEEE Transactions on Mobile Computing, 2024.","journal-title":"IEEE Transactions on Mobile Computing"},{"key":"e_1_3_2_1_57_1","volume-title":"Understanding large language models in your pockets: Performance study on cots mobile devices. arXiv preprint arXiv:2410.03613","author":"Xiao Jie","year":"2024","unstructured":"Jie Xiao, Qianyi Huang, Xu Chen, and Chen Tian. Understanding large language models in your pockets: Performance study on cots mobile devices. arXiv preprint arXiv:2410.03613, 2024."},{"key":"e_1_3_2_1_58_1","volume-title":"Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137","author":"Shao Wenqi","year":"2023","unstructured":"Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023."},{"key":"e_1_3_2_1_59_1","first-page":"38099","volume-title":"International Conference on Machine Learning","author":"Xiao Guangxuan","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087\u201338099. PMLR, 2023."},{"key":"e_1_3_2_1_60_1","volume-title":"Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022."},{"key":"e_1_3_2_1_61_1","first-page":"87","article-title":"Activation-aware weight quantization for on-device llm compression and acceleration","volume":"6","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proc. Machine Learning and Systems, 6:87\u2013100, 2024.","journal-title":"Proc. Machine Learning and Systems"},{"key":"e_1_3_2_1_62_1","volume-title":"MiniLLM: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543","author":"Gu Yuxian","year":"2023","unstructured":"Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023."},{"key":"e_1_3_2_1_63_1","first-page":"10337","volume-title":"International Conference on Machine Learning","author":"Frantar Elias","unstructured":"Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323\u201310337. PMLR, 2023."},{"key":"e_1_3_2_1_64_1","volume-title":"A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695","author":"Sun Mingjie","year":"2023","unstructured":"Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023."},{"key":"e_1_3_2_1_65_1","first-page":"21702","article-title":"On the structural pruning of large language models","volume":"36","author":"Ma Xinyin","year":"2023","unstructured":"Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems, 36:21702\u201321720, 2023.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_66_1","unstructured":"LLMFarm. https:\/\/github.com\/guinmoon\/LLMFarm. Accessed on June 2025."},{"key":"e_1_3_2_1_67_1","unstructured":"TensorRT-LLM. https:\/\/github.com\/NVIDIA\/TensorRT-LLM. Accessed on June 2025."},{"key":"e_1_3_2_1_68_1","unstructured":"llama.cpp. https:\/\/github.com\/ggml-org\/llama.cpp. Accessed on June 2025."}],"event":{"name":"SEC '25: Tenth ACM\/IEEE Symposium on Edge Computing","location":"the Hilton Arlington National Landing Arlington VA USA","acronym":"SEC '25","sponsor":["SIGMOBILE ACM Special Interest Group on Mobility of Systems, Users, Data and Computing","IEEE Computer Society"]},"container-title":["Proceedings of the Tenth ACM\/IEEE Symposium on Edge Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769102.3770614","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T16:00:55Z","timestamp":1764777655000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769102.3770614"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,3]]},"references-count":68,"alternative-id":["10.1145\/3769102.3770614","10.1145\/3769102"],"URL":"https:\/\/doi.org\/10.1145\/3769102.3770614","relation":{},"subject":[],"published":{"date-parts":[[2025,12,3]]},"assertion":[{"value":"2025-12-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}