{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T16:35:55Z","timestamp":1776357355758,"version":"3.51.2"},"reference-count":234,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2025,3,23]],"date-time":"2025-03-23T00:00:00Z","timestamp":1742688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["92467301, U23A20326, 62293511, and 62372414"],"award-info":[{"award-number":["92467301, U23A20326, 62293511, and 62372414"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100022963","name":"Key Research and Development Program of Zhejiang Province","doi-asserted-by":"crossref","award":["2025C01061 and 2025C01012"],"award-info":[{"award-number":["2025C01061 and 2025C01012"]}],"id":[{"id":"10.13039\/100022963","id-type":"DOI","asserted-by":"crossref"}]},{"name":"ZJUCSE-Enflame cloud and edge intelligence joint laboratory"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,8,31]]},"abstract":"<jats:p>Large language models (LLMs) have revolutionized natural language processing with their exceptional understanding, synthesizing, and reasoning capabilities. However, deploying LLMs on resource-constrained edge devices presents significant challenges due to computational limitations, memory constraints, and edge hardware heterogeneity. This survey provides a comprehensive overview of recent advancements in edge LLMs, covering the entire lifecycle\u2014from resource-efficient model design and pre-deployment strategies to runtime inference optimizations. It also explores on-device applications across various domains. By synthesizing state-of-the-art techniques and identifying future research directions, this survey bridges the gap between the immense potential of LLMs and the constraints of edge computing.<\/jats:p>","DOI":"10.1145\/3719664","type":"journal-article","created":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T11:02:46Z","timestamp":1740394966000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":61,"title":["A Review on Edge Large Language Models: Design, Execution, and Applications"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-0706-623X","authenticated-orcid":false,"given":"Yue","family":"Zheng","sequence":"first","affiliation":[{"name":"Zhejiang University of Technology, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9905-5823","authenticated-orcid":false,"given":"Yuhao","family":"Chen","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7058-0360","authenticated-orcid":false,"given":"Bin","family":"Qian","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2945-8344","authenticated-orcid":false,"given":"Xiufang","family":"Shi","sequence":"additional","affiliation":[{"name":"Zhejiang University of Technology, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9542-7095","authenticated-orcid":false,"given":"Yuanchao","family":"Shu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3155-3145","authenticated-orcid":false,"given":"Jiming","family":"Chen","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2025,3,23]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Marah Abdin Sam Ade Jacobs Ammar Ahmad Awan Jyoti Aneja Ahmed Awadallah Hany Awadalla Nguyen Bach Amit Bahree Arash Bakhtiari Harkirat Behl et\u00a0al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. Retrieved from https:\/\/arxiv.org\/abs\/2404.14219"},{"key":"e_1_3_2_3_2","article-title":"Quantifying attention flow in transformers","author":"Abnar Samira","year":"2020","unstructured":"Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_4_2","article-title":"GQA: Training generalized multi-query transformer models from multi-head checkpoints","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_5_2","article-title":"Llm in a flash: Efficient large language model inference with limited memory","author":"Alizadeh Keivan","year":"2024","unstructured":"Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C. Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. Llm in a flash: Efficient large language model inference with limited memory. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_6_2","article-title":"Fluctuation-based adaptive structured pruning for large language models","author":"An Yongqi","year":"2024","unstructured":"Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-based adaptive structured pruning for large language models. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_7_2","unstructured":"Yuvanesh Anand Zach Nussbaum Brandon Duderstadt Benjamin Schmidt and Andriy Mulyar. 2024. Introducing Llama 3.2. Retrieved December 5 2024 from https:\/\/www.llama.com\/docs\/model-cards-and-prompt-formats\/llama3_2\/"},{"key":"e_1_3_2_8_2","volume-title":"Proceedings of the ACM MobiSys","author":"Ananthanarayanan Ganesh","year":"2019","unstructured":"Ganesh Ananthanarayanan, Victor Bahl, Landon Cox, Alex Crown, Shadi Nogbahi, and Yuanchao Shu. 2019. Demo: Video analytics - killer app for edge computing. In Proceedings of the ACM MobiSys."},{"key":"e_1_3_2_9_2","unstructured":"Apple. 2023. Apple Debuts iPhone 15 and iPhone 15 Plus. Retrieved December 3 2024 from https:\/\/www.apple.com\/newsroom\/2023\/09\/apple-debuts-iphone-15-and-iphone-15-plus\/"},{"key":"e_1_3_2_10_2","unstructured":"Apple. 2023. Apple Introduces M2 Ultra. Retrieved December 3 2024 from https:\/\/www.apple.com\/newsroom\/2023\/06\/apple-introduces-m2-ultra\/"},{"key":"e_1_3_2_11_2","article-title":"SliceGPT: Compress large language models by deleting rows and columns","author":"Ashkboos Saleh","year":"2024","unstructured":"Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. SliceGPT: Compress large language models by deleting rows and columns. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_12_2","article-title":"Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding","author":"Bae Sangmin","year":"2023","unstructured":"Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_13_2","unstructured":"Jinze Bai Shuai Bai Yunfei Chu Zeyu Cui Kai Dang Xiaodong Deng Yang Fan Wenbin Ge Yu Han Fei Huang et\u00a0al. 2023. Qwen technical report. Retrieved from https:\/\/arxiv.org\/abs\/2309.16609"},{"key":"e_1_3_2_14_2","article-title":"Ekya: Continuous learning of video analytics models on edge compute servers","author":"Bhardwaj Romil","year":"2022","unstructured":"Romil Bhardwaj, Zhengxu Xia, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Nikolaos Karianakis, Kevin Hsieh, Paramvir Bahl, and Ion Stoica. 2022. Ekya: Continuous learning of video analytics models on edge compute servers. In Proceedings of the USENIX NSDI.","journal-title":"In Proceedings of the USENIX NSDI."},{"key":"e_1_3_2_15_2","article-title":"Pythia: A suite for analyzing large language models across training and scaling","author":"Biderman Stella","year":"2023","unstructured":"Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O\u2019Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et\u00a0al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_16_2","article-title":"Understanding and overcoming the challenges of efficient transformer quantization","author":"Bondarenko Yelysei","year":"2021","unstructured":"Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. 2021. Understanding and overcoming the challenges of efficient transformer quantization. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_17_2","article-title":"Petals: Collaborative inference and fine-tuning of large models","author":"Borzunov Alexander","year":"2023","unstructured":"Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Maksim Riabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. 2023. Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_18_2","article-title":"Distributed inference and fine-tuning of large language models over the internet","author":"Borzunov Alexander","year":"2023","unstructured":"Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin Raffel. 2023. Distributed inference and fine-tuning of large language models over the internet. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_19_2","article-title":"Language models are few-shot learners","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_20_2","article-title":"MobiVQA: Efficient on-device visual question answering","author":"Cao Qingqing","year":"2022","unstructured":"Qingqing Cao, Prerna Khanna, Nicholas D. Lane, and Aruna Balasubramanian. 2022. MobiVQA: Efficient on-device visual question answering. In Proceedings of the ACM UbiComp.","journal-title":"In Proceedings of the ACM UbiComp."},{"key":"e_1_3_2_21_2","article-title":"ChatEval: Towards Better LLM-based evaluators through multi-agent debate","author":"Chan Chi-Min","year":"2024","unstructured":"Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. ChatEval: Towards Better LLM-based evaluators through multi-agent debate. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_22_2","article-title":"Surgical feature-space decomposition of LLMs: Why, when and how?","author":"Chavan Arnav","year":"2024","unstructured":"Arnav Chavan, Nahush Lele, and Deepak Gupta. 2024. Surgical feature-space decomposition of LLMs: Why, when and how? In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_23_2","article-title":"MCC-KD: Multi-CoT consistent knowledge distillation","author":"Chen Hongzhan","year":"2023","unstructured":"Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, and Ji Zhang. 2023. MCC-KD: Multi-CoT consistent knowledge distillation. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_24_2","article-title":"Driving with llms: Fusing object-level vector modality for explainable autonomous driving","author":"Chen Long","year":"2024","unstructured":"Long Chen, Oleg Sinavski, Jan H\u00fcnermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. 2024. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In Proceedings of the IEEE ICRA.","journal-title":"In Proceedings of the IEEE ICRA."},{"key":"e_1_3_2_25_2","first-page":"353","article-title":"Graphwiz: An instruction-following language model for graph computational problems","author":"Chen Nuo","year":"2024","unstructured":"Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li. 2024. Graphwiz: An instruction-following language model for graph computational problems. In Proceedings of the ACM SIGKDD.353\u2013364.","journal-title":"In Proceedings of the ACM SIGKDD."},{"key":"e_1_3_2_26_2","article-title":"DRONE: Data-aware low-rank compression for large NLP models","author":"Chen Patrick","year":"2021","unstructured":"Patrick Chen, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. 2021. DRONE: Data-aware low-rank compression for large NLP models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_27_2","article-title":"Exploring the feasibility of remote cardiac auscultation using earphones","author":"Chen Tao","year":"2024","unstructured":"Tao Chen, Yongjie Yang, Xiaoran Fan, Xiuzhen Guo, Jie Xiong, and Longfei Shangguan. 2024. Exploring the feasibility of remote cardiac auscultation using earphones. In Proceedings of the ACM MobiCom.","journal-title":"In Proceedings of the ACM MobiCom."},{"key":"e_1_3_2_28_2","unstructured":"Yuhao Chen Yuxuan Yan Qianqian Yang Yuanchao Shu Shibo He and Jiming Chen. 2023. Confidant: Customizing transformer-based LLMs via collaborative edge training. Retrieved from https:\/\/arxiv.org\/abs\/2311.13381"},{"key":"e_1_3_2_29_2","article-title":"DISCO: Distilling counterfactuals with large language models","author":"Chen Zeming","year":"2023","unstructured":"Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. 2023. DISCO: Distilling counterfactuals with large language models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_30_2","article-title":"Adapting language models to compress contexts","author":"Chevalier Alexis","year":"2023","unstructured":"Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting language models to compress contexts. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_31_2","unstructured":"CPU-Monkey. 2024. AI Performance (NPU) CPU Benchmark List. Retrieved July 18 2024 from https:\/\/www.cpu-monkey.com\/en\/cpu_benchmark-ai_benchmark"},{"key":"e_1_3_2_32_2","volume-title":"In Proceedings of the ACM HotMobile","author":"Dai Yubin","year":"2025","unstructured":"Yubin Dai, Bin Qian, Yangkun Liu, Yuxuan Yan, and Yuanchao Shu. 2025. Eros: Real-time dense mapping made easy on mobile devices. In In Proceedings of the ACM HotMobile."},{"key":"e_1_3_2_33_2","article-title":"LLM. int8 () 8-bit matrix multiplication for transformers at scale","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM. int8 () 8-bit matrix multiplication for transformers at scale. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_34_2","article-title":"SpQR: A sparse-quantized representation for near-lossless LLM weight compression","author":"Dettmers Tim","year":"2023","unstructured":"Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2023. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_35_2","unstructured":"NVIDIA Developer. 2024. Jetson Modules Support Ecosystem and Lineup. Retrieved July 16 2024 from https:\/\/developer.nvidia.com\/embedded\/jetson-modules"},{"key":"e_1_3_2_36_2","article-title":"A survey of natural language generation","author":"Dong Chenhe","year":"2022","unstructured":"Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, Junxin Li, Ying Shen, and Min Yang. 2022. A survey of natural language generation. ACM Computing Surveys 55, 8 (2022), 1\u201338.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_37_2","article-title":"Unified language model pre-training for natural language understanding and generation","author":"Dong Li","year":"2019","unstructured":"Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_38_2","article-title":"Distributed foundation models for multi-modal learning in 6G wireless networks","author":"Du Jun","year":"2024","unstructured":"Jun Du, Tianyi Lin, Chunxiao Jiang, Qianqian Yang, C Faouzi Bader, and Zhu Han. 2024. Distributed foundation models for multi-modal learning in 6G wireless networks. IEEE Wireless Communications 31, 3 (2024), 20\u201330.","journal-title":"IEEE Wireless Communications"},{"key":"e_1_3_2_39_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et\u00a0al. 2024. The llama 3 herd of models. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_40_2","unstructured":"Darren Edge Ha Trinh Newman Cheng Joshua Bradley Alex Chao Apurva Mody Steven Truitt and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. Retrieved from https:\/\/arxiv.org\/abs\/2404.16130"},{"key":"e_1_3_2_41_2","article-title":"Extreme compression of large language models via additive quantization","author":"Egiazarian Vage","year":"2024","unstructured":"Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_42_2","unstructured":"Samsung Exyno. 2024. Exynos 2400 with 10-core CPU Gets Detailed After Galaxy S24 Launch. Retrieved December 11 2024 from https:\/\/www.sammobile.com\/news\/exynos-2400-10-core-cpu-amd-rdna3-xclipse-940-gpu-specs-detailed\/"},{"key":"e_1_3_2_43_2","article-title":"TaskFusion: An efficient transfer learning architecture with dual delta sparsity for multi-task natural language processing","author":"Fan Zichen","year":"2023","unstructured":"Zichen Fan, Qirui Zhang, Pierre Abillama, Sara Shoouri, Changwoo Lee, David Blaauw, Hun-Seok Kim, and Dennis Sylvester. 2023. TaskFusion: An efficient transfer learning architecture with dual delta sparsity for multi-task natural language processing. In Proceedings of the ACM\/IEEE ISCA.","journal-title":"In Proceedings of the ACM\/IEEE ISCA."},{"key":"e_1_3_2_44_2","article-title":"Optimal brain compression: A framework for accurate post-training quantization and pruning","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar and Dan Alistarh. 2022. Optimal brain compression: A framework for accurate post-training quantization and pruning. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_45_2","article-title":"SparseGPT: Massive language models can be accurately pruned in one-shot","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_46_2","article-title":"OPTQ: Accurate quantization for generative pre-trained transformers","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. OPTQ: Accurate quantization for generative pre-trained transformers. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_47_2","unstructured":"NVIDIA GeForce. 2023. Compare Gaming Laptops: GeForce RTX 40 Series. Retrieved December 06 2024 from https:\/\/www.nvidia.com\/en-us\/geforce\/laptops\/compare\/"},{"key":"e_1_3_2_48_2","unstructured":"NVIDIA GeForce. 2023. Compare GeForce Graphics Cards. Retrieved December 06 2024 from https:\/\/www.nvidia.com\/en-us\/geforce\/graphics-cards\/compare\/"},{"key":"e_1_3_2_49_2","article-title":"Is attention better than matrix decomposition?","author":"Geng Zhengyang","year":"2021","unstructured":"Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. 2021. Is attention better than matrix decomposition? In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_50_2","unstructured":"Georgi Gerganov. 2023. llama.cpp. Retrieved June 08 2024 from https:\/\/github.com\/ggerganov\/llama.cpp"},{"key":"e_1_3_2_51_2","unstructured":"Google. 2024. Google Tensor: The Brains Behind Pixel Phones. Retrieved December 06 2024 from https:\/\/store.google.com\/intl\/en_in\/ideas\/articles\/google-tensor-pixel-smartphone\/"},{"key":"e_1_3_2_52_2","article-title":"Power-bert: Accelerating bert inference via progressive word-vector elimination","author":"Goyal Saurabh","year":"2020","unstructured":"Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. Power-bert: Accelerating bert inference via progressive word-vector elimination. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_53_2","unstructured":"Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. Retrieved from https:\/\/arxiv.org\/abs\/2312.00752"},{"key":"e_1_3_2_54_2","article-title":"MiniLLM: Knowledge distillation of large language models","author":"Gu Yuxian","year":"2023","unstructured":"Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. MiniLLM: Knowledge distillation of large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_55_2","article-title":"Anomalygpt: Detecting industrial anomalies using large vision-language models","author":"Gu Zhaopeng","year":"2024","unstructured":"Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. 2024. Anomalygpt: Detecting industrial anomalies using large vision-language models. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_56_2","unstructured":"Suriya Gunasekar Yi Zhang Jyoti Aneja Caio C\u00e9sar Teodoro Mendes Allie Del Giorno Sivakanth Gopi Mojan Javaheripi Piero Kauffmann Gustavo de Rosa Olli Saarikivi et\u00a0al. 2023. Textbooks are all you need. Retrieved from https:\/\/arxiv.org\/abs\/2306.11644"},{"key":"e_1_3_2_57_2","article-title":"OliVe: Accelerating large language models via hardware-friendly outlier-victim pair quantization","author":"Guo Cong","year":"2023","unstructured":"Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2023. OliVe: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In Proceedings of the ACM\/IEEE ISCA.","journal-title":"In Proceedings of the ACM\/IEEE ISCA."},{"key":"e_1_3_2_58_2","article-title":"Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization","author":"Guo Cong","year":"2022","unstructured":"Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2022. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization. In Proceedings of the IEEE\/ACM MICRO.","journal-title":"In Proceedings of the IEEE\/ACM MICRO."},{"key":"e_1_3_2_59_2","unstructured":"Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi et\u00a0al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Retrieved from https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"e_1_3_2_60_2","article-title":"STI: Turbocharge NLP inference at the edge via elastic pipelining","author":"Guo Liwei","year":"2023","unstructured":"Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. 2023. STI: Turbocharge NLP inference at the edge via elastic pipelining. In Proceedings of the ACM ASPLOS.","journal-title":"In Proceedings of the ACM ASPLOS."},{"key":"e_1_3_2_61_2","article-title":"EASTER: Learning to split transformers at the edge robustly","author":"Guo Xiaotian","year":"2024","unstructured":"Xiaotian Guo, Quan Jiang, Yixian Shen, Andy D. Pimentel, and Todor Stefanov. 2024. EASTER: Learning to split transformers at the edge robustly. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3626\u20133637.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_3_2_62_2","article-title":"A real-world WebAgent with planning, long context understanding, and program synthesis","author":"Gur Izzeddin","year":"2024","unstructured":"Izzeddin Gur, Hiroki Furuta, Austin V. Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A real-world WebAgent with planning, long context understanding, and program synthesis. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_63_2","unstructured":"Awni Hannun Jagrit Digani Angelos Katharopoulos and Ronan Collobert. 2023. MLX: Efficient and Flexible Machine Learning on Apple Silicon. Retrieved June 08 2024 from https:\/\/github.com\/ml-explore"},{"key":"e_1_3_2_64_2","article-title":"Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach","author":"He Ying","year":"2024","unstructured":"Ying He, Jingcheng Fang, F. Richard Yu, and Victor C. Leung. 2024. Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach. IEEE Transactions on Mobile Computing 23, 12 (2024), 11253\u201311264.","journal-title":"IEEE Transactions on Mobile Computing"},{"key":"e_1_3_2_65_2","article-title":"A survey on recent approaches for natural language processing in low-resource scenarios","author":"Hedderich Michael A.","year":"2021","unstructured":"Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Str\u00f6tgen, and Dietrich Klakow. 2021. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the NAACL.","journal-title":"In Proceedings of the NAACL."},{"key":"e_1_3_2_66_2","article-title":"Large language models are reasoning teachers","author":"Ho Namgyu","year":"2023","unstructured":"Namgyu Ho, Laura Schmid, and Se-Young Yun. 2023. Large language models are reasoning teachers. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_67_2","article-title":"3d-llm: Injecting the 3d world into large language models","author":"Hong Yining","year":"2023","unstructured":"Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 2023. 3d-llm: Injecting the 3d world into large language models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_68_2","article-title":"A deep dive into large language models for automated bug localization and repair","author":"Hossain Soneya Binta","year":"2024","unstructured":"Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, and Omer Tripp. 2024. A deep dive into large language models for automated bug localization and repair. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1471\u20131493.","journal-title":"Proceedings of the ACM on Software Engineering"},{"key":"e_1_3_2_69_2","article-title":"Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes","author":"Hsieh Cheng-Yu","year":"2023","unstructured":"Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. In Findings of ACL.","journal-title":"In Findings of ACL."},{"key":"e_1_3_2_70_2","article-title":"Language model compression with weighted low-rank factorization","author":"Hsu Yen-Chang","year":"2022","unstructured":"Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. 2022. Language model compression with weighted low-rank factorization. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_71_2","article-title":"When the edge meets transformers: Distributed inference with transformer models","author":"Hu Chenghao","year":"2024","unstructured":"Chenghao Hu and Baochun Li. 2024. When the edge meets transformers: Distributed inference with transformer models. In Proceedings of the IEEE ICDCS.","journal-title":"In Proceedings of the IEEE ICDCS."},{"key":"e_1_3_2_72_2","article-title":"LoRA: Low-rank adaptation of large language models","author":"Hu Edward J.","year":"2021","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_73_2","article-title":"Inner monologue: Embodied reasoning through planning with language models","author":"Huang Wenlong","year":"2023","unstructured":"Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et\u00a0al. 2023. Inner monologue: Embodied reasoning through planning with language models. In Proceedings of the CoRL.","journal-title":"In Proceedings of the CoRL."},{"key":"e_1_3_2_74_2","unstructured":"Intel. 2020. Intel Neural Compressor. Retrieved December 27 2024 from https:\/\/github.com\/intel\/neural-compressor"},{"key":"e_1_3_2_75_2","unstructured":"Intel. 2022. Intel Launches 13th Gen Intel Core Processor Family Alongside New Intel Unison Solution. Retrieved December 11 2024 from https:\/\/www.intel.com\/content\/www\/us\/en\/newsroom\/news\/13th-gen-core-launch.html"},{"key":"e_1_3_2_76_2","article-title":"Spatula: Efficient cross-camera video analytics on large camera networks","author":"Jain Samvit","year":"2020","unstructured":"Samvit Jain, Xun Zhang, Yuhao Zhou, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Paramvir Bahl, and Joseph Gonzalez. 2020. Spatula: Efficient cross-camera video analytics on large camera networks. In Proceedings of the IEEE\/ACM SEC.","journal-title":"In Proceedings of the IEEE\/ACM SEC."},{"key":"e_1_3_2_77_2","article-title":"Winclip: Zero-\/few-shot anomaly classification and segmentation","author":"Jeong Jongheon","year":"2023","unstructured":"Jongheon Jeong, Yang Zou, Taewan Kim, Dongqing Zhang, Avinash Ravichandran, and Onkar Dabeer. 2023. Winclip: Zero-\/few-shot anomaly classification and segmentation. In Proceedings of the IEEE\/CVF CVPR.","journal-title":"In Proceedings of the IEEE\/CVF CVPR."},{"key":"e_1_3_2_78_2","article-title":"On the distribution, sparsity, and inference-time quantization of attention values in transformers","author":"Ji Tianchu","year":"2021","unstructured":"Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H. Andrew Schwartz, and Niranjan Balasubramanian. 2021. On the distribution, sparsity, and inference-time quantization of attention values in transformers. In Proceedings of the ACL-IJCNLP.","journal-title":"In Proceedings of the ACL-IJCNLP."},{"key":"e_1_3_2_79_2","article-title":"Feature-based low-rank compression of large language models via bayesian optimization","author":"Ji Yixin","year":"2024","unstructured":"Yixin Ji, Yang Xiang, Juntao Li, Wei Chen, Zhongyi Liu, Kehai Chen, and Min Zhang. 2024. Feature-based low-rank compression of large language models via bayesian optimization. In Findings of EMNLP.","journal-title":"In Findings of EMNLP."},{"key":"e_1_3_2_80_2","article-title":"LLMLingua: Compressing prompts for accelerated inference of large language models","author":"Jiang Huiqiang","year":"2023","unstructured":"Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_81_2","volume-title":"In Proceedings of the ACM Workshop on Hot Topics in Video Analytics and Intelligent Edges","author":"Jiang Junchen","year":"2019","unstructured":"Junchen Jiang, Yuhao Zhou, Ganesh Ananthanarayanan, Yuanchao Shu, and Andrew A. Chien. 2019. Networked cameras are the new big data clusters. In In Proceedings of the ACM Workshop on Hot Topics in Video Analytics and Intelligent Edges."},{"key":"e_1_3_2_82_2","doi-asserted-by":"crossref","DOI":"10.1145\/3447993.3483274","article-title":"Flexible high-resolution object detection on edge devices with tunable latency","author":"Jiang Shiqi","year":"2021","unstructured":"Shiqi Jiang, Zhiqi Lin, Yuanchun Li, Yuanchao Shu, and Yunxin Liu. 2021. Flexible high-resolution object detection on edge devices with tunable latency. In Proceedings of the ACM MobiCom.","journal-title":"In Proceedings of the ACM MobiCom."},{"key":"e_1_3_2_83_2","article-title":"Lion: Adversarial distillation of proprietary large language models","author":"Jiang Yuxin","year":"2023","unstructured":"Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. 2023. Lion: Adversarial distillation of proprietary large language models. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_84_2","article-title":"TinyBERT: Distilling BERT for natural language understanding","author":"Jiao Xiaoqi","year":"2020","unstructured":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_85_2","article-title":"A quantitative and qualitative evaluation of LLM-based explainable fault localization","author":"Kang Sungmin","year":"2024","unstructured":"Sungmin Kang, Gabin An, and Shin Yoo. 2024. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1424\u20131446.","journal-title":"Proceedings of the ACM on Software Engineering"},{"key":"e_1_3_2_86_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. Retrieved from https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_2_87_2","article-title":"RECL: Responsive resource-efficient continuous learning for video analytics","author":"Khani Mehrdad","year":"2023","unstructured":"Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. 2023. RECL: Responsive resource-efficient continuous learning for video analytics. In Proceedings of the USENIX NSDI.","journal-title":"In Proceedings of the USENIX NSDI."},{"key":"e_1_3_2_88_2","article-title":"Language models can solve computer tasks","author":"Kim Geunwoo","year":"2024","unstructured":"Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2024. Language models can solve computer tasks. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_89_2","article-title":"Length-adaptive transformer: Train once with length drop, use anytime with search","author":"Kim Gyuwan","year":"2021","unstructured":"Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the ACL-IJCNLP.","journal-title":"In Proceedings of the ACL-IJCNLP."},{"key":"e_1_3_2_90_2","article-title":"Token-scaled logit distillation for ternary weight generative language models","author":"Kim Minsoo","year":"2023","unstructured":"Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, and Jungwook Choi. 2023. Token-scaled logit distillation for ternary weight generative language models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_91_2","article-title":"20.5 C-Transformer: A 2.6-18.1 \\(\\mu\\) J\/Token homogeneous DNN-Transformer\/spiking-transformer processor with big-little network and implicit weight generation for large language models","author":"Kim Sangyeob","year":"2024","unstructured":"Sangyeob Kim, Sangjin Kim, Wooyoung Jo, Soyeon Kim, Seongyon Hong, and Hoi-Jun Yoo. 2024. 20.5 C-Transformer: A 2.6-18.1 \\(\\mu\\) J\/Token homogeneous DNN-Transformer\/spiking-transformer processor with big-little network and implicit weight generation for large language models. In Proceedings of the IEEE ISSCC.","journal-title":"In Proceedings of the IEEE ISSCC."},{"key":"e_1_3_2_92_2","article-title":"Learned token pruning for transformers","author":"Kim Sehoon","year":"2022","unstructured":"Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. 2022. Learned token pruning for transformers. In Proceedings of the ACM SIGKDD.","journal-title":"In Proceedings of the ACM SIGKDD."},{"key":"e_1_3_2_93_2","article-title":"An empirical survey on long document summarization: Datasets, models, and metrics","author":"Koh Huan Yee","year":"2022","unstructured":"Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2022. An empirical survey on long document summarization: Datasets, models, and metrics. ACM Computing Surveys (2022).","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_94_2","article-title":"Accelerating inference for pretrained language models by unified multi-perspective early exiting","author":"Kong Jun","year":"2022","unstructured":"Jun Kong, Jin Wang, Liang-Chih Yu, and Xuejie Zhang. 2022. Accelerating inference for pretrained language models by unified multi-perspective early exiting. In Proceedings of the COLING.","journal-title":"In Proceedings of the COLING."},{"key":"e_1_3_2_95_2","article-title":"The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models","author":"Kurtic Eldar","year":"2022","unstructured":"Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. 2022. The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_96_2","article-title":"Efficient memory management for large language model serving with PagedAttention","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SOSP.","journal-title":"In Proceedings of the ACM SOSP."},{"key":"e_1_3_2_97_2","article-title":"Biomistral: A collection of open-source pretrained large language models for medical domains","author":"Labrak Yanis","year":"2024","unstructured":"Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. In Findings of ACL.","journal-title":"In Findings of ACL."},{"key":"e_1_3_2_98_2","article-title":"A survey on automatic generation of figurative language: From rule-based systems to large language models","author":"Lai Huiyuan","year":"2024","unstructured":"Huiyuan Lai and Malvina Nissim. 2024. A survey on automatic generation of figurative language: From rule-based systems to large language models. ACM Computing Surveys 56, 10 (2024), 1\u201334.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_99_2","doi-asserted-by":"crossref","DOI":"10.1145\/3448974","article-title":"Recurrent neural networks for edge intelligence: A survey","author":"Lalapura Varsha S.","year":"2022","unstructured":"Varsha S. Lalapura, J. Amudha, and Hariramn Selvamuruga Satheesh. 2022. Recurrent neural networks for edge intelligence: A survey. ACM Computing Surveys 54, 4, (2022), 1\u201338.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_100_2","article-title":"ALBERT: A lite BERT for self-supervised learning of language representations","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_101_2","article-title":"MELTing point: Mobile evaluation of language transformers","author":"Laskaridis Stefanos","year":"2024","unstructured":"Stefanos Laskaridis, Kleomenis Kateveas, Lorenzo Minto, and Hamed Haddadi. 2024. MELTing point: Mobile evaluation of language transformers. In Proceedings of the ACM MobiCom.","journal-title":"In Proceedings of the ACM MobiCom."},{"key":"e_1_3_2_102_2","article-title":"OWQ: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models","author":"Lee Changhun","year":"2024","unstructured":"Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. 2024. OWQ: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_103_2","article-title":"An autonomous parallelization of transformer model inference on heterogeneous edge devices","author":"Lee Juhyeon","year":"2024","unstructured":"Juhyeon Lee, Insung Bahk, Hoseung Kim, Sinjin Jeong, Suyeon Lee, and Donghyun Min. 2024. An autonomous parallelization of transformer model inference on heterogeneous edge devices. In Proceedings of the ACM ICS.","journal-title":"In Proceedings of the ACM ICS."},{"key":"e_1_3_2_104_2","article-title":"Fast inference from transformers via speculative decoding","author":"Leviathan Yaniv","year":"2023","unstructured":"Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_105_2","article-title":"Efficient transformer-based large scale language representations using hardware-friendly block structured pruning","author":"Li Bingbing","year":"2020","unstructured":"Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, and Caiwen Ding. 2020. Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_106_2","article-title":"SheetCopilot: Bringing software productivity to the next level through large language models","author":"Li Hongxin","year":"2024","unstructured":"Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and ZHAO-XIANG ZHANG. 2024. SheetCopilot: Bringing software productivity to the next level through large language models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_107_2","article-title":"Symbolic chain-of-thought distillation: Small models can also \u201cThink\u201d step-by-step","author":"Li Liunian Harold","year":"2023","unstructured":"Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also \u201cThink\u201d step-by-step. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_108_2","unstructured":"Yuanzhi Li S\u00e9bastien Bubeck Ronen Eldan Allie Del Giorno Suriya Gunasekar and Yin Tat Lee. 2023. Textbooks are all you need ii: Phi-1.5 technical report. Retrieved from https:\/\/arxiv.org\/abs\/2309.05463"},{"key":"e_1_3_2_109_2","unstructured":"Yuanchun Li Hao Wen Weijun Wang Xiangyu Li Yizhen Yuan Guohong Liu Jiacheng Liu Wenxing Xu Xiang Wang Yi Sun et\u00a0al. 2024. Personal llm agents: Insights and survey about the capability efficiency and security. Retrieved from https:\/\/arxiv.org\/abs\/2401.05459"},{"key":"e_1_3_2_110_2","article-title":"LoSparse: Structured compression of large language models based on low-rank and sparse approximation","author":"Li Yixiao","year":"2023","unstructured":"Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. LoSparse: Structured compression of large language models based on low-rank and sparse approximation. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_111_2","article-title":"Less is more: Task-aware layer-wise distillation for language model compression","author":"Liang Chen","year":"2023","unstructured":"Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. Less is more: Task-aware layer-wise distillation for language model compression. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_112_2","unstructured":"Opher Lieber Barak Lenz Hofit Bata Gal Cohen Jhonathan Osin Itay Dalmedigos Erez Safahi Shaked Meirom Yonatan Belinkov Shai Shalev-Shwartz et\u00a0al. 2024. Jamba: A hybrid transformer-mamba language model. Retrieved from https:\/\/arxiv.org\/abs\/2403.19887"},{"key":"e_1_3_2_113_2","article-title":"AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of the MLSys.","journal-title":"In Proceedings of the MLSys."},{"key":"e_1_3_2_114_2","unstructured":"Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan et\u00a0al. 2024. Deepseek-v3 technical report. Retrieved from https:\/\/arxiv.org\/abs\/2412.19437"},{"key":"e_1_3_2_115_2","article-title":"QLLM: Accurate and efficient low-bitwidth quantization for large language models","author":"Liu Jing","year":"2023","unstructured":"Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. 2023. QLLM: Accurate and efficient low-bitwidth quantization for large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_116_2","article-title":"Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing","author":"Liu Pengfei","year":"2023","unstructured":"Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 9 (2023), 1\u201335.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_117_2","article-title":"FastBERT: A self-distilling BERT with adaptive inference time","author":"Liu Weijie","year":"2020","unstructured":"Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. 2020. FastBERT: A self-distilling BERT with adaptive inference time. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_118_2","article-title":"Mobilellm: Optimizing sub-billion parameter language models for on-device use cases","author":"Liu Zechun","year":"2024","unstructured":"Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et\u00a0al. 2024. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_119_2","article-title":"Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture","author":"Lu Liqiang","year":"2021","unstructured":"Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In Proceedings of the IEEE\/ACM MICRO.","journal-title":"In Proceedings of the IEEE\/ACM MICRO."},{"key":"e_1_3_2_120_2","article-title":"A multimodal generative AI copilot for human pathology","author":"Lu Ming Y.","year":"2024","unstructured":"Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Melissa Zhao, Aaron K. Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et\u00a0al. 2024. A multimodal generative AI copilot for human pathology. Nature 634, 8033 (2024), 466\u2013473.","journal-title":"Nature"},{"key":"e_1_3_2_121_2","article-title":"Turbo: Opportunistic enhancement for edge video analytics","author":"Lu Yan","year":"2022","unstructured":"Yan Lu, Shiqi Jiang, Ting Cao, and Yuanchao Shu. 2022. Turbo: Opportunistic enhancement for edge video analytics. In Proceedings of the ACM SenSys.","journal-title":"In Proceedings of the ACM SenSys."},{"key":"e_1_3_2_122_2","volume-title":"In Proceedings of the ACM\/IEEE SEC","author":"Lu Yan","year":"2019","unstructured":"Yan Lu, Yuanchao Shu, Xu Tan, Yunxin Liu, Mengyu Zhou, Qi Chen, and Dan Pei. 2019. Collaborative learning between cloud and end devices: An empirical study on location prediction. In In Proceedings of the ACM\/IEEE SEC."},{"key":"e_1_3_2_123_2","article-title":"Multi-view domain adaptive object detection on camera networks","author":"Lu Yan","year":"2023","unstructured":"Yan Lu, Zhun Zhong, and Yuanchao Shu. 2023. Multi-view domain adaptive object detection on camera networks. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_124_2","article-title":"HPipe: Large language model pipeline parallelism for long context on heterogeneous cost-effective devices","author":"Ma Ruilong","year":"2024","unstructured":"Ruilong Ma, Xiang Yang, Jingyu Wang, Qi Qi, Haifeng Sun, Jing Wang, Zirui Zhuang, and Jianxin Liao. 2024. HPipe: Large language model pipeline parallelism for long context on heterogeneous cost-effective devices. In Proceedings of the NAACL-HLT.","journal-title":"In Proceedings of the NAACL-HLT."},{"key":"e_1_3_2_125_2","article-title":"LLM-Pruner: On the structural pruning of large language models","author":"Ma Xinyin","year":"2023","unstructured":"Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. LLM-Pruner: On the structural pruning of large language models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_126_2","article-title":"Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation","author":"Ma Xinbei","year":"2024","unstructured":"Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. In Findings of ACL.","journal-title":"In Findings of ACL."},{"key":"e_1_3_2_127_2","article-title":"Delight: Deep and light-weight transformer","author":"Mehta Sachin","year":"2021","unstructured":"Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Delight: Deep and light-weight transformer. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_128_2","unstructured":"Sachin Mehta Mohammad Hossein Sekhavat Qingqing Cao Maxwell Horton Yanzi Jin Chenfan Sun Iman Mirzadeh Mahyar Najibi Dmitry Belenko Peter Zatloukal et\u00a0al. 2024. OpenELM: An efficient language model family with open-source training and inference framework. Retrieved from https:\/\/arxiv.org\/abs\/2404.14619"},{"key":"e_1_3_2_129_2","unstructured":"Microsoft. 2018. ONNX Runtime is a Cross-platform Inference and Training Machine-learning Accelerator. Retrieved December 29 2024 from https:\/\/github.com\/microsoft\/onnxruntime"},{"key":"e_1_3_2_130_2","unstructured":"Microsoft. 2023. Microsoft Copilot. Retrieved June 14 2024 from https:\/\/copilot.microsoft.com\/"},{"key":"e_1_3_2_131_2","unstructured":"S\u00e9bastien Bubeck Mojan Javaheripi. 2023. Phi-2: The Surprising Power of Small Language Models. Retrieved June 01 2024 from https:\/\/www.microsoft.com\/en-us\/research\/blog\/phi-2-the-surprising-power-of-small-language-models\/"},{"key":"e_1_3_2_132_2","article-title":"DNNFusion: Accelerating deep neural networks execution with advanced operator fusion","author":"Niu Wei","year":"2021","unstructured":"Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the ACM PLDI.","journal-title":"In Proceedings of the ACM PLDI."},{"key":"e_1_3_2_133_2","article-title":"SmartMem: Layout transformation elimination and adaptation for efficient DNN execution on mobile","author":"Niu Wei","year":"2024","unstructured":"Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, and Bin Ren. 2024. SmartMem: Layout transformation elimination and adaptation for efficient DNN execution on mobile. In Proceedings of the ACM ASPLOS.","journal-title":"In Proceedings of the ACM ASPLOS."},{"key":"e_1_3_2_134_2","unstructured":"NVIDIA. 2023. TensorRT-LLM. Retrieved June 09 2024 from https:\/\/github.com\/NVIDIA\/TensorRT-LLM"},{"key":"e_1_3_2_135_2","unstructured":"NVIDIA. 2024. Minitron-4B-Base. Retrieved December 19 2024 from https:\/\/huggingface.co\/nvidia\/Minitron-4B-Base"},{"key":"e_1_3_2_136_2","unstructured":"OpenAI. 2022. Introducing ChatGPT. Retrieved July 05 2024 from https:\/\/openai.com\/index\/chatgpt\/"},{"key":"e_1_3_2_137_2","article-title":"Training language models to follow instructions with human feedback","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et\u00a0al. 2022. Training language models to follow instructions with human feedback. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_138_2","article-title":"Gemel: Model merging for memory-efficient, real-time video analytics at the edge","author":"Padmanabhan Arthi","year":"2023","unstructured":"Arthi Padmanabhan, Neil Agarwal, Anand Iyer, Ganesh Ananthanarayanan, Yuanchao Shu, Nikolaos Karianakis, Guoqing Harry Xu, and Ravi Netravali. 2023. Gemel: Model merging for memory-efficient, real-time video analytics at the edge. In Proceedings of the USENIX NSDI.","journal-title":"In Proceedings of the USENIX NSDI."},{"key":"e_1_3_2_139_2","article-title":"Propagating knowledge updates to lms through distillation","author":"Padmanabhan Shankar","year":"2024","unstructured":"Shankar Padmanabhan, Yasumasa Onoe, Michael Zhang, Greg Durrett, and Eunsol Choi. 2024. Propagating knowledge updates to lms through distillation. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_140_2","article-title":"VLP: Vision language planning for autonomous driving","author":"Pan Chenbin","year":"2024","unstructured":"Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. 2024. VLP: Vision language planning for autonomous driving. In Proceedings of the IEEE\/CVF CVPR.","journal-title":"In Proceedings of the IEEE\/CVF CVPR."},{"key":"e_1_3_2_141_2","unstructured":"Jupinder Parmar Shrimai Prabhumoye Joseph Jennings Mostofa Patwary Sandeep Subramanian Dan Su Chen Zhu Deepak Narayanan Aastha Jhunjhunwala Ayush Dattagupta et\u00a0al. 2024. Nemotron-4 15B technical report. Retrieved from https:\/\/arxiv.org\/abs\/2402.16819"},{"key":"e_1_3_2_142_2","unstructured":"Raspberry Pi. 2019. Raspberry Pi 4 on Sale Now. Retrieved December 29 2024 from https:\/\/www.raspberrypi.com\/news\/raspberry-pi-4-on-sale-now-from-35\/"},{"key":"e_1_3_2_143_2","unstructured":"Pytorch. 2023. ExecuTorch. Retrieved December 12 2024 from https:\/\/github.com\/pytorch\/executorch"},{"key":"e_1_3_2_144_2","article-title":"Interactive continual learning: Fast and slow thinking","author":"Qi Biqing","year":"2024","unstructured":"Biqing Qi, Xinquan Chen, Junqi Gao, Dong Li, Jianxing Liu, Ligang Wu, and Bowen Zhou. 2024. Interactive continual learning: Fast and slow thinking. In Proceedings of the IEEE\/CVF CVPR.","journal-title":"In Proceedings of the IEEE\/CVF CVPR."},{"key":"e_1_3_2_145_2","unstructured":"Guanqiao Qu Qiyuan Chen Wei Wei Zheng Lin Xianhao Chen and Kaibin Huang. 2024. Mobile edge intelligence for large language models: A contemporary survey. Retrieved from https:\/\/arxiv.org\/abs\/2407.18921"},{"key":"e_1_3_2_146_2","unstructured":"Qualcomm. 2024. Snapdragon 8 Series Mobile Platforms | Qualcomm. Retrieved July 17 2024 from https:\/\/www.qualcomm.com\/products\/mobile\/snapdragon\/smartphones\/snapdragon-8-series-mobile-platforms"},{"key":"e_1_3_2_147_2","unstructured":"Brian Rakowski. 2023. Pixel 8 Pro \u2013 the First Smartphone with AI Built in \u2013 is Now Running Gemini Nano Plus More AI Updates Coming to the Pixel Portfolio. Retrieved December 09 2024 from https:\/\/blog.google\/products\/pixel\/pixel-feature-drop-december-2023\/"},{"key":"e_1_3_2_148_2","article-title":"ODSearch: Fast and resource efficient on-device natural language search for fitness trackers\u2019 data","author":"Rawassizadeh Reza","year":"2023","unstructured":"Reza Rawassizadeh and Yi Rong. 2023. ODSearch: Fast and resource efficient on-device natural language search for fitness trackers\u2019 data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 4 (2023), 1\u201325.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"},{"key":"e_1_3_2_149_2","unstructured":"Yangjun Ruan Chris J. Maddison and Tatsunori Hashimoto. 2024. Observational scaling laws and the predictability of language model performance. Retrieved from https:\/\/arxiv.org\/abs\/2405.10938"},{"key":"e_1_3_2_150_2","article-title":"Matrix compression via randomized low rank and low precision factorization","author":"Saha Rajarshi","year":"2023","unstructured":"Rajarshi Saha, Varun Srivastava, and Mert Pilanci. 2023. Matrix compression via randomized low rank and low precision factorization. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_151_2","article-title":"Movement pruning: Adaptive sparsity by fine-tuning","author":"Sanh Victor","year":"2020","unstructured":"Victor Sanh, Thomas Wolf, and Alexander Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_152_2","article-title":"OmniQuant: Omnidirectionally calibrated quantization for large language models","author":"Shao Wenqi","year":"2023","unstructured":"Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023. OmniQuant: Omnidirectionally calibrated quantization for large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_153_2","article-title":"Q-BERT: Hessian based ultra low precision quantization of BERT","author":"Shen Sheng","year":"2020","unstructured":"Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_154_2","article-title":"Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge","author":"Shen Xuan","year":"2024","unstructured":"Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, and Yanzhi Wang. 2024. Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_155_2","article-title":"FlexGen: High-throughput generative inference of large language models with a single GPU","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-throughput generative inference of large language models with a single GPU. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_156_2","article-title":"Distilling reasoning capabilities into smaller language models","author":"Shridhar Kumar","year":"2023","unstructured":"Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. 2023. Distilling reasoning capabilities into smaller language models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_157_2","article-title":"Efficient acceleration of deep learning inference on resource-constrained edge devices: A review","author":"Shuvo Md Maruf Hossain","year":"2022","unstructured":"Md Maruf Hossain Shuvo, Syed Kamrul Islam, Jianlin Cheng, and Bashir I. Morshed. 2022. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review. Proceedings of the IEEE 111, 1 (2022), 42\u201391.","journal-title":"Proceedings of the IEEE 111, 1"},{"key":"e_1_3_2_158_2","article-title":"Powerinfer: Fast large language model serving with a consumer-grade gpu","author":"Song Yixin","year":"2024","unstructured":"Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. Powerinfer: Fast large language model serving with a consumer-grade gpu. In Proceedings of the ACM SOSP.","journal-title":"In Proceedings of the ACM SOSP."},{"key":"e_1_3_2_159_2","doi-asserted-by":"crossref","DOI":"10.1109\/TVLSI.2023.3282046","article-title":"X-former: In-memory acceleration of transformers","author":"Sridharan Shrihari","year":"2023","unstructured":"Shrihari Sridharan, Jacob R. Stevens, Kaushik Roy, and Anand Raghunathan. 2023. X-former: In-memory acceleration of transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 31, 8 (2023), 1223\u20131233.","journal-title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems"},{"key":"e_1_3_2_160_2","article-title":"RoFormer: Enhanced transformer with rotary position embedding","author":"Su Jianlin","year":"2024","unstructured":"Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with rotary position embedding. Elsevier Neurocomputing 568, 1 (2024), 1\u201314.","journal-title":"Elsevier Neurocomputing"},{"key":"e_1_3_2_161_2","article-title":"A simple and effective pruning approach for large language models","author":"Sun Mingjie","year":"2024","unstructured":"Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2024. A simple and effective pruning approach for large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_162_2","article-title":"Spectr: Fast speculative decoding via optimal transport","author":"Sun Ziteng","year":"2024","unstructured":"Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. 2024. Spectr: Fast speculative decoding via optimal transport. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_163_2","article-title":"MobileBERT: A compact task-agnostic BERT for resource-limited devices","author":"Sun Zhiqing","year":"2020","unstructured":"Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_164_2","article-title":"EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference","author":"Tambe Thierry","year":"2021","unstructured":"Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2021. EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference. In Proceedings of the IEEE\/ACM MICRO.","journal-title":"In Proceedings of the IEEE\/ACM MICRO."},{"key":"e_1_3_2_165_2","article-title":"Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference","author":"Tambe Thierry","year":"2020","unstructured":"Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei. 2020. Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference. In Proceedings of the ACM\/IEEE DAC.","journal-title":"In Proceedings of the ACM\/IEEE DAC."},{"key":"e_1_3_2_166_2","article-title":"22.9 A 12nm 18.1 TFLOPs\/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management","author":"Tambe Thierry","year":"2023","unstructured":"Thierry Tambe, Jeff Zhang, Coleman Hooper, Tianyu Jia, Paul N. Whatmough, Joseph Zuckerman, Maico Cassel Dos Santos, Erik Jens Loscalzo, Davide Giri, Kenneth Shepard, et\u00a0al. 2023. 22.9 A 12nm 18.1 TFLOPs\/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management. In Proceedings of the IEEE ISSCC.","journal-title":"In Proceedings of the IEEE ISSCC."},{"key":"e_1_3_2_167_2","article-title":"22.9 A 12nm 18.1TFLOPs\/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management","author":"Tambe Thierry","year":"2023","unstructured":"Thierry Tambe, Jeff Zhang, Coleman Hooper, Tianyu Jia, Paul N. Whatmough, Joseph Zuckerman, Maico Cassel Dos Santos, Erik Jens Loscalzo, Davide Giri, Kenneth Shepard, et al.2023. 22.9 A 12nm 18.1TFLOPs\/W sparse transformer processor with entropy-based early exit, mixed-precision predication and fine-grained power management. In Proceedings of the IEEE ISSCC.","journal-title":"In Proceedings of the IEEE ISSCC."},{"key":"e_1_3_2_168_2","first-page":"491","article-title":"Graphgpt: Graph instruction tuning for large language models","author":"Tang Jiabin","year":"2024","unstructured":"Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang. 2024. Graphgpt: Graph instruction tuning for large language models. In Proceedings of the ACM SIGIR.491\u2013500.","journal-title":"In Proceedings of the ACM SIGIR."},{"key":"e_1_3_2_169_2","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M Dai Anja Hauth Katie Millican et\u00a0al. 2023. Gemini: A family of highly capable multimodal models. Retrieved from https:\/\/arxiv.org\/abs\/2312.11805"},{"key":"e_1_3_2_170_2","unstructured":"Gemma Team Thomas Mesnard Cassidy Hardin Robert Dadashi Surya Bhupatiraju Shreya Pathak Laurent Sifre Morgane Rivi\u00e8re Mihir Sanjay Kale Juliette Love et\u00a0al. 2024. Gemma: Open models based on gemini research and technology. Retrieved from https:\/\/arxiv.org\/abs\/2403.08295"},{"key":"e_1_3_2_171_2","unstructured":"Gemma Team Morgane Riviere Shreya Pathak Pier Giuseppe Sessa Cassidy Hardin Surya Bhupatiraju L\u00e9onard Hussenot Thomas Mesnard Bobak Shahriari Alexandre Ram\u00e9 et\u00a0al. 2024. Gemma 2: Improving open language models at a practical size. Retrieved from https:\/\/arxiv.org\/abs\/2408.00118"},{"key":"e_1_3_2_172_2","unstructured":"MLC Team. 2023. MLC LLM. Retrieved June 06 2024 from https:\/\/github.com\/mlc-ai\/mlc-llm"},{"key":"e_1_3_2_173_2","unstructured":"Qualcomm Team. 2024. Snapdragon 8 Gen 3 Mobile Platform. Retrieved December 03 2024 from https:\/\/www.qualcomm.com\/products\/mobile\/snapdragon\/smartphones\/snapdragon-8-series-mobile-platforms\/snapdragon-8-gen-3-mobile-platform"},{"key":"e_1_3_2_174_2","article-title":"Large-scale deterministic networks: Architecture, enabling technologies, case study and future directions","author":"Tian Wenbin","year":"2024","unstructured":"Wenbin Tian, Chaojie Gu, Miao Guo, Shibo He, Jiawen Kang, Dusit Niyato, and Jiming Chen. 2024. Large-scale deterministic networks: Architecture, enabling technologies, case study and future directions. In Proceedings of the IEEE Network.","journal-title":"In Proceedings of the IEEE Network."},{"key":"e_1_3_2_175_2","unstructured":"Xiaoyu Tian Junru Gu Bailin Li Yicheng Liu Yang Wang Zhiyong Zhao Kun Zhan Peng Jia Xianpeng Lang and Hang Zhao. 2024. Drivevlm: The convergence of autonomous driving and large vision-language models. Retrieved from https:\/\/arxiv.org\/abs\/2402.12289"},{"key":"e_1_3_2_176_2","article-title":"Dialogue summarization with mixture of experts based on large language models","author":"Tian Yuanhe","year":"2024","unstructured":"Yuanhe Tian, Fei Xia, and Yan Song. 2024. Dialogue summarization with mixture of experts based on large language models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_177_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_2_178_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_179_2","doi-asserted-by":"crossref","DOI":"10.1162\/tacl_a_00577","article-title":"Efficient methods for natural language processing: A survey","author":"Treviso Marcos","year":"2023","unstructured":"Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, et al.2023. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics 1, 11 (2023), 826\u2013860.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_2_180_2","article-title":"Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks","author":"Tseng Albert","year":"2024","unstructured":"Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_181_2","article-title":"A 28nm 15.59  \\(\\mu\\) J\/token full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline\/parallel reconfigurable modes","author":"Tu Fengbin","year":"2022","unstructured":"Fengbin Tu, Zihan Wu, Yiqi Wang, Ling Liang, Liu Liu, Yufei Ding, Leibo Liu, Shaojun Wei, Yuan Xie, and Shouyi Yin. 2022. A 28nm 15.59 \\(\\mu\\) J\/token full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline\/parallel reconfigurable modes. In Proceedings of the IEEE ISSCC.","journal-title":"In Proceedings of the IEEE ISSCC."},{"key":"e_1_3_2_182_2","article-title":"16.1 MuITCIM: A 28nm  \\(2.24 \\mu \\mathrm{J}\\) \/Token attention-token-bit hybrid sparse digital CIM-Based accelerator for multimodal transformers","author":"Tu Fengbin","year":"2023","unstructured":"Fengbin Tu, Zihan Wu, Yiqi Wang, Weiwei Wu, Leibo Liu, Yang Hu, Shaojun Wei, and Shouyi Yin. 2023. 16.1 MuITCIM: A 28nm \\(2.24 \\mu \\mathrm{J}\\) \/Token attention-token-bit hybrid sparse digital CIM-Based accelerator for multimodal transformers. In Proceedings of the IEEE ISSCC.","journal-title":"In Proceedings of the IEEE ISSCC."},{"key":"e_1_3_2_183_2","article-title":"AccelTran: A sparsity-aware accelerator for dynamic inference with transformers","author":"Tuli Shikhar","year":"2023","unstructured":"Shikhar Tuli and Niraj K. Jha. 2023. AccelTran: A sparsity-aware accelerator for dynamic inference with transformers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 11 (2023), 4038\u20134051.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_3_2_184_2","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_185_2","article-title":"Knowledge fusion of large language models","author":"Wan Fanqi","year":"2024","unstructured":"Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. 2024. Knowledge fusion of large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_186_2","article-title":"Efficient large language models: A survey","author":"Wan Zhongwei","year":"2024","unstructured":"Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, and Mi Zhang. 2024. Efficient large language models: A survey. Transactions on Machine Learning Research 1, 1 (2024), 1\u201367.","journal-title":"Transactions on Machine Learning Research"},{"key":"e_1_3_2_187_2","article-title":"SpAtten: Efficient sparse attention architecture with cascade token and head pruning","author":"Wang Hanrui","year":"2021","unstructured":"Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient sparse attention architecture with cascade token and head pruning. In Proceedings of the IEEE HPCA.","journal-title":"In Proceedings of the IEEE HPCA."},{"issue":"4","key":"e_1_3_2_188_2","doi-asserted-by":"crossref","first-page":"4420","DOI":"10.1109\/JIOT.2024.3485748","article-title":"End-to-end multitarget flexible job shop scheduling with deep reinforcement learning","volume":"12","author":"Wang Rongkai","year":"2025","unstructured":"Rongkai Wang, Yiyang Jing, Chaojie Gu, Shibo He, and Jiming Chen. 2025. End-to-end multitarget flexible job shop scheduling with deep reinforcement learning. IEEE Internet of Things Journal 12, 4 (2025), 4420\u20134434.","journal-title":"IEEE Internet of Things Journal"},{"key":"e_1_3_2_189_2","article-title":"MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers","author":"Wang Wenhui","year":"2021","unstructured":"Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Proceedings of the ACL-IJCNLP.","journal-title":"In Proceedings of the ACL-IJCNLP."},{"key":"e_1_3_2_190_2","article-title":"MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers","author":"Wang Wenhui","year":"2020","unstructured":"Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_191_2","article-title":"Convergence of edge computing and deep learning: A comprehensive survey","author":"Wang Xiaofei","year":"2020","unstructured":"Xiaofei Wang, Yiwen Han, Victor C. M. Leung, Dusit Niyato, Xueqiang Yan, and Xu Chen. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys and Tutorials 22, 2 (2020), 869\u2013904.","journal-title":"IEEE Communications Surveys and Tutorials"},{"key":"e_1_3_2_192_2","article-title":"Tabi: An efficient multi-level inference system for large language models","author":"Wang Yiding","year":"2023","unstructured":"Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the ACM EuroSys.","journal-title":"In Proceedings of the ACM EuroSys."},{"key":"e_1_3_2_193_2","article-title":"End-edge-cloud collaborative computing for deep learning: A comprehensive survey","author":"Wang Yingchao","year":"2024","unstructured":"Yingchao Wang, Chen Yang, Shulin Lan, Liehuang Zhu, and Yan Zhang. 2024. End-edge-cloud collaborative computing for deep learning: A comprehensive survey. IEEE Communications Surveys and Tutorials 26, 4 (2024), 2647\u20132683.","journal-title":"IEEE Communications Surveys and Tutorials"},{"key":"e_1_3_2_194_2","article-title":"Fed-DFA: Federated distillation for heterogeneous model fusion through the adversarial lens","author":"Wang Zichen","year":"2025","unstructured":"Zichen Wang, Feng Yan, Tianyi Wang, Cong Wang, Yuanchao Shu, Peng Cheng, and Jiming Chen. 2025. Fed-DFA: Federated distillation for heterogeneous model fusion through the adversarial lens. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_195_2","article-title":"AutoDroid: LLM-powered task automation in Android","author":"Wen Hao","year":"2024","unstructured":"Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered task automation in Android. In Proceedings of the ACM MobiCom.","journal-title":"In Proceedings of the ACM MobiCom."},{"key":"e_1_3_2_196_2","article-title":"LaMini-LM: A diverse herd of distilled models from large-scale instructions","author":"Wu Minghao","year":"2024","unstructured":"Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Aji. 2024. LaMini-LM: A diverse herd of distilled models from large-scale instructions. In Proceedings of the EACL.","journal-title":"In Proceedings of the EACL."},{"key":"e_1_3_2_197_2","article-title":"Sheared LLaMA: Accelerating language model pre-training via structured pruning","author":"Xia Mengzhou","year":"2024","unstructured":"Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2024. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_198_2","article-title":"Structured pruning learns compact and accurate models","author":"Xia Mengzhou","year":"2022","unstructured":"Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_199_2","article-title":"Language models meet world models: Embodied experiences enhance language models","author":"Xiang Jiannan","year":"2024","unstructured":"Jiannan Xiang, Tianhua Tao, Yi Gu, Tianmin Shu, Zirui Wang, Zichao Yang, and Zhiting Hu. 2024. Language models meet world models: Embodied experiences enhance language models. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_200_2","article-title":"SmoothQuant: Accurate and efficient post-training quantization for large language models","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the ICML.","journal-title":"In Proceedings of the ICML."},{"key":"e_1_3_2_201_2","article-title":"A survey on model compression and acceleration for pretrained language models","author":"Xu Canwen","year":"2023","unstructured":"Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_202_2","article-title":"EdgeLLM: Fast On-device LLM inference with speculative decoding","author":"Xu Daliang","year":"2024","unstructured":"Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, and Xuanzhe Liu. 2024. EdgeLLM: Fast On-device LLM inference with speculative decoding. IEEE Transactions on Mobile Computing 1, 1 (2024), 1\u201318.","journal-title":"IEEE Transactions on Mobile Computing"},{"key":"e_1_3_2_203_2","article-title":"MESEN: Exploit multimodal data to design unimodal human activity recognition with few labels","author":"Xu Lilin","year":"2023","unstructured":"Lilin Xu, Chaojie Gu, Rui Tan, Shibo He, and Jiming Chen. 2023. MESEN: Exploit multimodal data to design unimodal human activity recognition with few labels. In Proceedings of the ACM SenSys.","journal-title":"In Proceedings of the ACM SenSys."},{"key":"e_1_3_2_204_2","article-title":"GesturePrint: Enabling user identification for mmWave-based gesture recognition systems","author":"Xu Lilin","year":"2024","unstructured":"Lilin Xu, Keyi Wang, Chaojie Gu, Xiuzhen Guo, Shibo He, and Jiming Chen. 2024. GesturePrint: Enabling user identification for mmWave-based gesture recognition systems. In Proceedings of the IEEE ICDCS.","journal-title":"In Proceedings of the IEEE ICDCS."},{"key":"e_1_3_2_205_2","article-title":"BESA: Pruning large language models with blockwise parameter-efficient sparsity allocation","author":"Xu Peng","year":"2024","unstructured":"Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, and Ping Luo. 2024. BESA: Pruning large language models with blockwise parameter-efficient sparsity allocation. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_206_2","article-title":"Mental-llm: Leveraging large language models for mental health prediction via online text data","author":"Xu Xuhai","year":"2024","unstructured":"Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. 2024. Mental-llm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1\u201332.","journal-title":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1"},{"key":"e_1_3_2_207_2","unstructured":"Yuxuan Yan Qianqian Yang Shunpu Tang and Zhiguo Shi. 2024. Federa: Efficient fine-tuning of language models in federated learning leveraging weight decomposition. Retrieved from https:\/\/arxiv.org\/abs\/2404.18848"},{"key":"e_1_3_2_208_2","unstructured":"An Yang Baosong Yang Binyuan Hui Bo Zheng Bowen Yu Chang Zhou Chengpeng Li Chengyuan Li Dayiheng Liu Fei Huang et\u00a0al. 2024. Qwen2 technical report. Retrieved from https:\/\/arxiv.org\/abs\/2407.10671"},{"key":"e_1_3_2_209_2","article-title":"Large language models for test-free fault localization","author":"Yang Aidan Z. H.","year":"2024","unstructured":"Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. In Proceedings of the ACM\/IEEE ICSE.","journal-title":"In Proceedings of the ACM\/IEEE ICSE."},{"key":"e_1_3_2_210_2","article-title":"MentaLLaMA: Interpretable mental health analysis on social media with large language models","author":"Yang Kailai","year":"2024","unstructured":"Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. MentaLLaMA: Interpretable mental health analysis on social media with large language models. In Proceedings of the ACM WWW.","journal-title":"In Proceedings of the ACM WWW."},{"key":"e_1_3_2_211_2","article-title":"MAF: Exploring mobile acoustic field for hand-to-face gesture interactions","author":"Yang Yongjie","year":"2024","unstructured":"Yongjie Yang, Tao Chen, Yujing Huang, Xiuzhen Guo, and Longfei Shangguan. 2024. MAF: Exploring mobile acoustic field for hand-to-face gesture interactions. In Proceedings of the ACM CHI.","journal-title":"In Proceedings of the ACM CHI."},{"key":"e_1_3_2_212_2","article-title":"ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_213_2","article-title":"Boost transformer-based language models with GPU-Friendly sparsity and quantization","author":"Yu Chong","year":"2023","unstructured":"Chong Yu, Tao Chen, and Zhongxue Gan. 2023. Boost transformer-based language models with GPU-Friendly sparsity and quantization. In Findings of ACL.","journal-title":"In Findings of ACL."},{"key":"e_1_3_2_214_2","article-title":"Mobile foundation model as firmware","author":"Yuan Jinliang","year":"2024","unstructured":"Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, et\u00a0al. 2024. Mobile foundation model as firmware. In Proceedings of the ACM MobiCom.","journal-title":"In Proceedings of the ACM MobiCom."},{"key":"e_1_3_2_215_2","article-title":"Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference","author":"Zadeh Ali Hadi","year":"2020","unstructured":"Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In Proceedings of the IEEE\/ACM MICRO.","journal-title":"In Proceedings of the IEEE\/ACM MICRO."},{"key":"e_1_3_2_216_2","article-title":"Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models","author":"Zadeh Ali Hadi","year":"2022","unstructured":"Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, and Andreas Moshovos. 2022. Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models. In Proceedings of the ACM\/IEEE ISCA.","journal-title":"In Proceedings of the ACM\/IEEE ISCA."},{"key":"e_1_3_2_217_2","article-title":"ConsistentEE: A consistent and hardness-guided early exiting method for accelerating language models inference","author":"Zeng Ziqian","year":"2024","unstructured":"Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, and Cen Chen. 2024. ConsistentEE: A consistent and hardness-guided early exiting method for accelerating language models inference. In Proceedings of the AAAI.","journal-title":"In Proceedings of the AAAI."},{"key":"e_1_3_2_218_2","article-title":"Lifting the curse of capacity gap in distilling language models","author":"Zhang Chen","year":"2023","unstructured":"Chen Zhang, Yang Yang, Jiahao Liu, Jingang Wang, Yunsen Xian, Benyou Wang, and Dawei Song. 2023. Lifting the curse of capacity gap in distilling language models. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_219_2","article-title":"LlamaTouch: A faithful and scalable testbed for mobile UI automation task evaluation","author":"Zhang Li","year":"2024","unstructured":"Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, and Mengwei Xu. 2024. LlamaTouch: A faithful and scalable testbed for mobile UI automation task evaluation. In Proceedings of the ACM UIST.","journal-title":"In Proceedings of the ACM UIST."},{"key":"e_1_3_2_220_2","article-title":"LoRAPrune: Structured pruning meets low-rank parameter-efficient fine-tuning","author":"Zhang Mingyang","year":"2024","unstructured":"Mingyang Zhang, Hao Chen, Chunhua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, and Bohan Zhuang. 2024. LoRAPrune: Structured pruning meets low-rank parameter-efficient fine-tuning. In Findlings of ACL.","journal-title":"In Findlings of ACL."},{"key":"e_1_3_2_221_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"},{"key":"e_1_3_2_222_2","article-title":"TernaryBERT: Distillation-aware Ultra-low Bit BERT","author":"Zhang Wei","year":"2020","unstructured":"Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. 2020. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In Proceedings of the EMNLP.","journal-title":"In Proceedings of the EMNLP."},{"key":"e_1_3_2_223_2","article-title":"Resource management in mobile edge computing: A comprehensive survey","author":"Zhang Xiaojie","year":"2023","unstructured":"Xiaojie Zhang and Saptarshi Debroy. 2023. Resource management in mobile edge computing: A comprehensive survey. ACM Computing Surveys 55, 131 (2023), 1\u201337.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_224_2","article-title":"Beyond the cloud: Edge inference for generative large language models in wireless networks","author":"Zhang Xinyuan","year":"2024","unstructured":"Xinyuan Zhang, Jiangtian Nie, Yudong Huang, Gaochang Xie, Zehui Xiong, Jiang Liu, Dusit Niyato, and Xuemin Sherman Shen. 2024. Beyond the cloud: Edge inference for generative large language models in wireless networks. IEEE Transactions on Wireless Communications 24, 1 (2024), 643\u2013658.","journal-title":"IEEE Transactions on Wireless Communications"},{"key":"e_1_3_2_225_2","article-title":"Plug-and-play: An efficient post-training pruning method for large language models","author":"Zhang Yingtao","year":"2024","unstructured":"Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. 2024. Plug-and-play: An efficient post-training pruning method for large language models. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_226_2","article-title":"Vulcan: Automatic query planning for live ML analytics","author":"Zhang Yiwen","year":"2024","unstructured":"Yiwen Zhang, Xumiao Zhang, Ganesh Ananthanarayanan, Anand Iyer, Yuanchao Shu, Victor Bahl, Z. Morley Mao, and Mosharaf Chowdhury. 2024. Vulcan: Automatic query planning for live ML analytics. In Proceedings of the USENIX NSDI.","journal-title":"In Proceedings of the USENIX NSDI."},{"key":"e_1_3_2_227_2","article-title":"Q-Hitter: A better token oracle for efficient LLM inference via sparse-quantized KV cache","author":"Zhang Zhenyu","year":"2024","unstructured":"Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang. 2024. Q-Hitter: A better token oracle for efficient LLM inference via sparse-quantized KV cache. In Proceedings of the MLSys.","journal-title":"In Proceedings of the MLSys."},{"key":"e_1_3_2_228_2","article-title":"LinguaLinked: Distributed large language model inference on mobile devices","author":"Zhao Junchen","year":"2024","unstructured":"Junchen Zhao, Yurun Song, Simeng Liu, Ian Harris, and Sangeetha Abdu Jyothi. 2024. LinguaLinked: Distributed large language model inference on mobile devices. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_229_2","article-title":"DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the USENIX OSDI.","journal-title":"In Proceedings of the USENIX OSDI."},{"key":"e_1_3_2_230_2","article-title":"Transpim: A memory-based acceleration via software-hardware co-design for transformer","author":"Zhou Minxuan","year":"2022","unstructured":"Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In Proceedings of the IEEE HPCA.","journal-title":"In Proceedings of the IEEE HPCA."},{"key":"e_1_3_2_231_2","article-title":"AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection","author":"Zhou Qihang","year":"2024","unstructured":"Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. 2024. AnomalyCLIP: Object-agnostic prompt learning for zero-shot anomaly detection. In Proceedings of the ICLR.","journal-title":"In Proceedings of the ICLR."},{"key":"e_1_3_2_232_2","article-title":"BERT loses patience: Fast and robust inference with early exit","author":"Zhou Wangchunshu","year":"2020","unstructured":"Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. BERT loses patience: Fast and robust inference with early exit. In Proceedings of the NeurIPS.","journal-title":"In Proceedings of the NeurIPS."},{"key":"e_1_3_2_233_2","article-title":"Do LLMs understand visual anomalies? Uncovering LLM\u2019s capabilities in zero-shot anomaly detection","author":"Zhu Jiaqi","year":"2024","unstructured":"Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, and Junran Wu. 2024. Do LLMs understand visual anomalies? Uncovering LLM\u2019s capabilities in zero-shot anomaly detection. In Proceedings of the ACM MM.","journal-title":"In Proceedings of the ACM MM."},{"key":"e_1_3_2_234_2","article-title":"LeeBERT: Learned early exit for BERT with cross-level optimization","author":"Zhu Wei","year":"2021","unstructured":"Wei Zhu. 2021. LeeBERT: Learned early exit for BERT with cross-level optimization. In Proceedings of the ACL.","journal-title":"In Proceedings of the ACL."},{"key":"e_1_3_2_235_2","article-title":"Towards an on-device agent for text rewriting","author":"Zhu Yun","year":"2024","unstructured":"Yun Zhu, Yinxiao Liu, Felix Stahlberg, Shankar Kumar, Yu-Hui Chen, Liangchen Luo, Lei Shu, Renjie Liu, Jindong Chen, and Lei Meng. 2024. Towards an on-device agent for text rewriting. In Findings of NAACL.","journal-title":"In Findings of NAACL."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3719664","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3719664","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T18:43:22Z","timestamp":1750272202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3719664"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,23]]},"references-count":234,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,8,31]]}},"alternative-id":["10.1145\/3719664"],"URL":"https:\/\/doi.org\/10.1145\/3719664","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,23]]},"assertion":[{"value":"2024-09-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-14","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}