{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T09:09:51Z","timestamp":1779008991493,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":104,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,5,10]],"date-time":"2026-05-10T00:00:00Z","timestamp":1778371200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62325201"],"award-info":[{"award-number":["62325201"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"NSFC","award":["62522202"],"award-info":[{"award-number":["62522202"]}]},{"name":"Beijing Natural Science Foundation","award":["L253005"],"award-info":[{"award-number":["L253005"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,5,11]]},"DOI":"10.1145\/3774906.3800479","type":"proceedings-article","created":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T14:20:14Z","timestamp":1778250014000},"page":"377-391","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["An Efficient Context Management System for On-Device LLMaaS"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-6242-4368","authenticated-orcid":false,"given":"Wangsong","family":"Yin","sequence":"first","affiliation":[{"name":"Peking University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6271-6993","authenticated-orcid":false,"given":"Mengwei","family":"Xu","sequence":"additional","affiliation":[{"name":"BUPT, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1591-2526","authenticated-orcid":false,"given":"Yuanchun","family":"Li","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7908-8484","authenticated-orcid":false,"given":"Xuanzhe","family":"Liu","sequence":"additional","affiliation":[{"name":"Peking University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,5,10]]},"reference":[{"key":"e_1_3_3_2_2_2","unstructured":"2024. AICore. https:\/\/developer.android.com\/ml\/aicore."},{"key":"e_1_3_3_2_3_2","unstructured":"2024. Andriod low-memory killer. https:\/\/developer.android.com\/topic\/performance\/memory-management#low-memory_killer."},{"key":"e_1_3_3_2_4_2","unstructured":"2024. Gboard Smart Reply. https:\/\/developers.google.com\/ml-kit\/language\/smart-reply."},{"key":"e_1_3_3_2_5_2","unstructured":"2024. Glarity. https:\/\/glarity.app\/."},{"key":"e_1_3_3_2_6_2","unstructured":"2024. GPT-based email writer. https:\/\/hix.ai\/ai-email-writer-email-generator."},{"key":"e_1_3_3_2_7_2","unstructured":"2024. GPT4-Turbo. https:\/\/platform.openai.com\/docs\/models\/gpt-4-and-gpt-4-turbo."},{"key":"e_1_3_3_2_8_2","unstructured":"2024. Jetson Orin NX. https:\/\/www.nvidia.com\/en-us\/autonomous-machines\/embedded-systems\/jetson-orin\/."},{"key":"e_1_3_3_2_9_2","unstructured":"2024. Jetson TX2. https:\/\/developer.nvidia.com\/embedded\/jetson-tx2."},{"key":"e_1_3_3_2_10_2","unstructured":"2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite. https:\/\/developers.googleblog.com\/2024\/03\/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html."},{"key":"e_1_3_3_2_11_2","unstructured":"2024. Llama.cpp. https:\/\/github.com\/ggerganov\/llama.cpp."},{"key":"e_1_3_3_2_12_2","unstructured":"2024. LLM-based AI-Assistant. https:\/\/github.com\/avsrma\/LLM-based-AI-Assistant."},{"key":"e_1_3_3_2_13_2","unstructured":"2024. LLM customer service and support. https:\/\/www.databricks.com\/solutions\/accelerators\/llms-customer-service-and-support."},{"key":"e_1_3_3_2_14_2","unstructured":"2024. LLM telegram chatbot. https:\/\/github.com\/Fatal3xcept10n\/LLM-Telegram-Chatbot."},{"key":"e_1_3_3_2_15_2","unstructured":"2024. LM Deploy. https:\/\/github.com\/InternLM\/lmdeploy\/tree\/main."},{"key":"e_1_3_3_2_16_2","unstructured":"2024. MI14 smartphone. https:\/\/en.wikipedia.org\/wiki\/Xiaomi_14."},{"key":"e_1_3_3_2_17_2","unstructured":"2024. News Summarization with LLM. https:\/\/github.com\/KillerStrike17\/News-Summarization-with-LLM."},{"key":"e_1_3_3_2_18_2","unstructured":"2024. Pickle. https:\/\/docs.python.org\/3\/library\/pickle.html."},{"key":"e_1_3_3_2_19_2","unstructured":"2024. Pickle-in-Cpp. https:\/\/github.com\/Usama-Azad\/Pickle-in-Cpp."},{"key":"e_1_3_3_2_20_2","unstructured":"2024. Snapdragon 8 gen 3 mobile platform product brief. https:\/\/docs.qualcomm.com\/bundle\/publicresource\/87-71408-1_REV_C_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf."},{"key":"e_1_3_3_2_21_2","unstructured":"2024. zRAM. https:\/\/en.wikipedia.org\/wiki\/Zram."},{"key":"e_1_3_3_2_22_2","unstructured":"Reyna Abhyankar Zijian He Vikranth Srivatsa Hao Zhang and Yiying Zhang. 2024. APIServe: Efficient API Support for Large-Language Model Inferencing. arxiv:https:\/\/arXiv.org\/abs\/2402.01869\u00a0[cs.LG]"},{"key":"e_1_3_3_2_23_2","unstructured":"OpenAI:\u00a0Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya et\u00a0al. 2023. GPT-4 Technical Report. arxiv:https:\/\/arXiv.org\/abs\/2303.08774\u00a0[cs.CL]"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Keivan Alizadeh Iman Mirzadeh Dmitry Belenko Karen Khatamifard Minsik Cho Carlo C\u00a0Del Mundo Mohammad Rastegari and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arxiv:https:\/\/arXiv.org\/abs\/2312.11514\u00a0[cs.CL]","DOI":"10.18653\/v1\/2024.acl-long.678"},{"key":"e_1_3_3_2_25_2","unstructured":"Ebtesam Almazrouei Hamza Alobeidli Abdulaziz Alshamsi Alessandro Cappelli Ruxandra Cojocaru M\u00e9rouane Debbah \u00c9tienne Goffinet Daniel Hesslow Julien Launay Quentin Malartic Daniele Mazzotta Badreddine Noune Baptiste Pannier and Guilherme Penedo. 2023. The Falcon Series of Open Language Models. arxiv:https:\/\/arXiv.org\/abs\/2311.16867\u00a0[cs.CL]"},{"key":"e_1_3_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684822.2685302"},{"key":"e_1_3_3_2_27_2","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv:https:\/\/arXiv.org\/abs\/1409.0473\u00a0[cs.CL]"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4717"},{"key":"e_1_3_3_2_29_2","unstructured":"Tom\u00a0B. Brown Benjamin Mann Nick Ryder et\u00a0al. 2020. Language Models are Few-Shot Learners. arxiv:https:\/\/arXiv.org\/abs\/2005.14165\u00a0[cs.CL]"},{"key":"e_1_3_3_2_30_2","unstructured":"Le Chen Dahu Feng Erhu Feng Yingrui Wang Rong Zhao Yubin Xia Pinjie Xu and Haibo Chen. 2025. Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference. arxiv:https:\/\/arXiv.org\/abs\/2501.14794\u00a0[cs.DC] https:\/\/arxiv.org\/abs\/2501.14794"},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Qiwei Chen Huan Zhao Wei Li Pipei Huang and Wenwu Ou. 2019. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba. arxiv:https:\/\/arXiv.org\/abs\/1905.06874\u00a0[cs.IR]","DOI":"10.1145\/3326937.3341261"},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"publisher","unstructured":"Weiduo Chen Xiaoshe Dong Fan Zhang Bowen Li Yufei Wang and Qiang Wang. 2025. ATP: Achieving Throughput Peak for DNN Training via Smart GPU Memory Management. ACM Trans. Archit. Code Optim. 22 1 Article 21 (March 2025) 27\u00a0pages. 10.1145\/3701996","DOI":"10.1145\/3701996"},{"key":"e_1_3_3_2_33_2","unstructured":"Wei-Lin Chiang Zhuohan Li Zi Lin Ying Sheng Zhanghao Wu Hao Zhang Lianmin Zheng Siyuan Zhuang Yonghao Zhuang Joseph\u00a0E. Gonzalez Ion Stoica and Eric\u00a0P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"e_1_3_3_2_34_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:https:\/\/arXiv.org\/abs\/1810.04805\u00a0[cs.CL]"},{"key":"e_1_3_3_2_35_2","unstructured":"Qingxiu Dong Lei Li Damai Dai Ce Zheng Zhiyong Wu Baobao Chang Xu Sun Jingjing Xu Lei Li and Zhifang Sui. 2023. A Survey on In-context Learning. arxiv:https:\/\/arXiv.org\/abs\/2301.00234\u00a0[cs.CL]"},{"key":"e_1_3_3_2_36_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arxiv:https:\/\/arXiv.org\/abs\/2210.17323\u00a0[cs.LG]"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","unstructured":"Thomas\u00a0Mesnard Gemma\u00a0Team Cassidy Hardin Robert Dadashi Surya Bhupatiraju Laurent Sifre Morgane Rivi\u00e8re Mihir\u00a0Sanjay Kale Juliette Love Pouya Tafti L\u00e9onard Hussenot and et al.2024. Gemma. (2024). 10.34740\/KAGGLE\/M\/3301","DOI":"10.34740\/KAGGLE\/M\/3301"},{"key":"e_1_3_3_2_38_2","unstructured":"In Gim Guojun Chen Seung seob Lee Nikhil Sarda Anurag Khandelwal and Lin Zhong. 2023. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arxiv:https:\/\/arXiv.org\/abs\/2311.04934\u00a0[cs.CL]"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5409"},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575698"},{"key":"e_1_3_3_2_41_2","doi-asserted-by":"publisher","unstructured":"Weichao Guo Kang Chen Huan Feng Yongwei Wu Rui Zhang and Weimin Zheng. 2016. MARS : Mobile Application Relaunching Speed-Up through Flash-Aware Page Swapping. IEEE Trans. Comput. 65 3 (2016) 916\u2013928. 10.1109\/TC.2015.2428692","DOI":"10.1109\/TC.2015.2428692"},{"key":"e_1_3_3_2_42_2","unstructured":"Song Han Huizi Mao and William\u00a0J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning Trained Quantization and Huffman Coding. arxiv:https:\/\/arXiv.org\/abs\/1510.00149\u00a0[cs.CV]"},{"key":"e_1_3_3_2_43_2","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arxiv:https:\/\/arXiv.org\/abs\/2009.03300\u00a0[cs.CY]"},{"key":"e_1_3_3_2_44_2","first-page":"1693","volume-title":"NIPS","author":"Hermann Karl\u00a0Moritz","year":"2015","unstructured":"Karl\u00a0Moritz Hermann, Tom\u00e1s Kocisk\u00fd, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In NIPS. 1693\u20131701. http:\/\/papers.nips.cc\/paper\/5945-teaching-machines-to-read-and-comprehend"},{"key":"e_1_3_3_2_45_2","unstructured":"Edward\u00a0J. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arxiv:https:\/\/arXiv.org\/abs\/2106.09685\u00a0[cs.CL]"},{"key":"e_1_3_3_2_46_2","doi-asserted-by":"publisher","unstructured":"Sang-Hoon Kim Jinkyu Jeong and Jin-Soo Kim. 2017. Application-Aware Swapping for Mobile Systems. ACM Trans. Embed. Comput. Syst. 16 5s Article 182 (sep 2017) 19\u00a0pages. 10.1145\/3126509","DOI":"10.1145\/3126509"},{"key":"e_1_3_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_3_2_48_2","first-page":"873","volume-title":"2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Lebeck Niel","year":"2020","unstructured":"Niel Lebeck, Arvind Krishnamurthy, Henry\u00a0M. Levy, and Irene Zhang. 2020. End the Senseless Killing: Improving Memory Management for Mobile Operating Systems. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 873\u2013887. https:\/\/www.usenix.org\/conference\/atc20\/presentation\/lebeck"},{"key":"e_1_3_3_2_49_2","unstructured":"Ruihao Li Shagnik Pal Vineeth\u00a0Narayan Pullu Prasoon Sinha Jeeho Ryoo Lizy\u00a0K. John and Neeraja\u00a0J. Yadwadkar. 2025. MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving. arxiv:https:\/\/arXiv.org\/abs\/2507.11507\u00a0[cs.OS] https:\/\/arxiv.org\/abs\/2507.11507"},{"key":"e_1_3_3_2_50_2","unstructured":"Yuanchun Li Hao Wen Weijun Wang Xiangyu Li Yizhen Yuan Guohong Liu Jiacheng Liu Wenxing Xu Xiang Wang Yi Sun Rui Kong Yile Wang Hanfei Geng Jian Luan Xuefeng Jin Zilong Ye Guanjing Xiong Fan Zhang Xiang Li Mengwei Xu Zhijun Li Peng Li Yang Liu Ya-Qin Zhang and Yunxin Liu. 2024. Personal LLM Agents: Insights and Survey about the Capability Efficiency and Security. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2401.05459 (2024)."},{"key":"e_1_3_3_2_51_2","unstructured":"Yujun Lin Haotian Tang Shang Yang Zhekai Zhang Guangxuan Xiao Chuang Gan and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving. arxiv:https:\/\/arXiv.org\/abs\/2405.04532"},{"key":"e_1_3_3_2_52_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong\u00a0Jae Lee. 2023. Visual Instruction Tuning. arxiv:https:\/\/arXiv.org\/abs\/2304.08485\u00a0[cs.CV]"},{"key":"e_1_3_3_2_53_2","doi-asserted-by":"crossref","unstructured":"Yuhan Liu Hanchen Li Yihua Cheng Siddhant Ray Yuyang Huang Qizheng Zhang Kuntai Du Jiayi Yao Shan Lu Ganesh Ananthanarayanan Michael Maire Henry Hoffmann Ari Holtzman and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving. arxiv:https:\/\/arXiv.org\/abs\/2310.07240","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_3_3_2_54_2","unstructured":"Shuming Ma Hongyu Wang Lingxiao Ma Lei Wang Wenhui Wang Shaohan Huang Li Dong Ruiping Wang Jilong Xue and Furu Wei. 2024. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arxiv:https:\/\/arXiv.org\/abs\/2402.17764\u00a0[cs.CL]"},{"key":"e_1_3_3_2_55_2","unstructured":"Zehong Ma Longhui Wei Feng Wang Shiliang Zhang and Qi Tian. 2025. MagCache: Fast Video Generation with Magnitude-Aware Cache. arxiv:https:\/\/arXiv.org\/abs\/2506.09045\u00a0[cs.CV] https:\/\/arxiv.org\/abs\/2506.09045"},{"key":"e_1_3_3_2_56_2","volume-title":"Forty-second International Conference on Machine Learning","author":"Ma Zehong","year":"2025","unstructured":"Zehong Ma, Shiliang Zhang, Longhui Wei, and Qi Tian. 2025. Efficient Multi-modal Long Context Learning for Training-free Adaptation. In Forty-second International Conference on Machine Learning. https:\/\/openreview.net\/forum?id=6Rvs8jluQP"},{"key":"e_1_3_3_2_57_2","unstructured":"Sourab Mangrulkar Sylvain Gugger Lysandre Debut Younes Belkada Sayak Paul and Benjamin Bossan. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https:\/\/github.com\/huggingface\/peft."},{"key":"e_1_3_3_2_58_2","unstructured":"Stephen Merity Caiming Xiong James Bradbury and Richard Socher. 2016. Pointer Sentinel Mixture Models. arxiv:https:\/\/arXiv.org\/abs\/1609.07843\u00a0[cs.CL]"},{"key":"e_1_3_3_2_59_2","doi-asserted-by":"crossref","unstructured":"Sewon Min Xinxi Lyu Ari Holtzman Mikel Artetxe Mike Lewis Hannaneh Hajishirzi and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arxiv:https:\/\/arXiv.org\/abs\/2202.12837\u00a0[cs.CL]","DOI":"10.18653\/v1\/2022.emnlp-main.759"},{"key":"e_1_3_3_2_60_2","volume-title":"mllm","author":"team mllm","year":"2023","unstructured":"mllm team. 2023. mllm. https:\/\/github.com\/UbiquitousLearning\/mllm"},{"key":"e_1_3_3_2_61_2","doi-asserted-by":"crossref","unstructured":"Shashi Narayan Shay\u00a0B. Cohen and Mirella Lapata. 2018. Don\u2019t Give Me the Details Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv abs\/1808.08745 (2018).","DOI":"10.18653\/v1\/D18-1206"},{"key":"e_1_3_3_2_62_2","unstructured":"Charles Packer Sarah Wooders Kevin Lin Vivian Fang Shishir\u00a0G. Patil Ion Stoica and Joseph\u00a0E. Gonzalez. 2024. MemGPT: Towards LLMs as Operating Systems. arxiv:https:\/\/arXiv.org\/abs\/2310.08560\u00a0[cs.AI] https:\/\/arxiv.org\/abs\/2310.08560"},{"key":"e_1_3_3_2_63_2","unstructured":"Zhuoshi Pan Qianhui Wu Huiqiang Jiang Menglin Xia Xufang Luo Jue Zhang Qingwei Lin Victor R\u00fchle Yuqing Yang Chin-Yew Lin H.\u00a0Vicky Zhao Lili Qiu and Dongmei Zhang. 2024. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. arxiv:https:\/\/arXiv.org\/abs\/2403.12968"},{"key":"e_1_3_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/2493432.2493490"},{"key":"e_1_3_3_2_65_2","unstructured":"Reiner Pope Sholto Douglas Aakanksha Chowdhery Jacob Devlin James Bradbury Anselm Levskaya Jonathan Heek Kefan Xiao Shivani Agrawal and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arxiv:https:\/\/arXiv.org\/abs\/2211.05102\u00a0[cs.LG]"},{"key":"e_1_3_3_2_66_2","doi-asserted-by":"publisher","unstructured":"Bozidar Radunovic and Jean-Yves Le\u00a0Boudec. 2007. A Unified Framework for Max-Min and Min-Max Fairness With Applications. IEEE\/ACM Transactions on Networking 15 5 (2007) 1073\u20131083. 10.1109\/TNET.2007.896231","DOI":"10.1109\/TNET.2007.896231"},{"key":"e_1_3_3_2_67_2","doi-asserted-by":"crossref","unstructured":"Pranav Rajpurkar Jian Zhang Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100 000+ Questions for Machine Comprehension of Text. arxiv:https:\/\/arXiv.org\/abs\/1606.05250\u00a0[cs.CL]","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_3_2_68_2","unstructured":"Leming Shen Qiang Yang Xinyu Huang Zijing Ma and Yuanqing Zheng. 2025. GPIoT: Tailoring Small Language Models for IoT Program Synthesis and Development. arxiv:https:\/\/arXiv.org\/abs\/2503.00686\u00a0[cs.SE] https:\/\/arxiv.org\/abs\/2503.00686"},{"key":"e_1_3_3_2_69_2","unstructured":"Leming Shen Qiang Yang Yuanqing Zheng and Mo Li. 2025. AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications. arxiv:https:\/\/arXiv.org\/abs\/2503.05346\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2503.05346"},{"key":"e_1_3_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D13-1170"},{"key":"e_1_3_3_2_71_2","unstructured":"Yixin Song Zeyu Mi Haotong Xie and Haibo Chen. 2023. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. arxiv:https:\/\/arXiv.org\/abs\/2312.12456\u00a0[cs.LG]"},{"key":"e_1_3_3_2_72_2","unstructured":"Yi Su Yuechi Zhou Quantong Qiu Juntao Li Qingrong Xia Ping Li Xinyu Duan Zhefeng Wang and Min Zhang. 2025. Accurate KV Cache Quantization with Outlier Tokens Tracing. arxiv:https:\/\/arXiv.org\/abs\/2505.10938\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2505.10938"},{"key":"e_1_3_3_2_73_2","unstructured":"Gemini Team Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu et\u00a0al. 2023. Gemini: A Family of Highly Capable Multimodal Models. arxiv:https:\/\/arXiv.org\/abs\/2312.11805\u00a0[cs.CL]"},{"key":"e_1_3_3_2_74_2","volume-title":"MLC-LLM","author":"team MLC","year":"2023","unstructured":"MLC team. 2023. MLC-LLM. https:\/\/github.com\/mlc-ai\/mlc-llm"},{"key":"e_1_3_3_2_75_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei et\u00a0al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv:https:\/\/arXiv.org\/abs\/2307.09288\u00a0[cs.CL]"},{"key":"e_1_3_3_2_76_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N. Gomez Lukasz Kaiser and Illia Polosukhin. 2023. Attention Is All You Need. arxiv:https:\/\/arXiv.org\/abs\/1706.03762\u00a0[cs.CL]"},{"key":"e_1_3_3_2_77_2","doi-asserted-by":"crossref","unstructured":"Bryan Wang Gang Li and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models. arxiv:https:\/\/arXiv.org\/abs\/2209.08655\u00a0[cs.HC]","DOI":"10.1145\/3544548.3580895"},{"key":"e_1_3_3_2_78_2","doi-asserted-by":"publisher","unstructured":"Bryan Wang Gang Li and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI using Large Language Models(CHI \u201923). Association for Computing Machinery New York NY USA. 10.1145\/3544548.3580895","DOI":"10.1145\/3544548.3580895"},{"key":"e_1_3_3_2_79_2","unstructured":"Hao Wen Yuanchun Li Guohong Liu Shanhui Zhao Tao Yu Toby Jia-Jun Li Shiqi Jiang Yunhao Liu Yaqin Zhang and Yunxin Liu. 2023. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2308.15272 (2023)."},{"key":"e_1_3_3_2_80_2","doi-asserted-by":"crossref","unstructured":"Hao Wen Yuanchun Li Guohong Liu Shanhui Zhao Tao Yu Toby Jia-Jun Li Shiqi Jiang Yunhao Liu Yaqin Zhang and Yunxin Liu. 2024. AutoDroid: LLM-powered Task Automation in Android. arxiv:https:\/\/arXiv.org\/abs\/2308.15272","DOI":"10.1145\/3636534.3649379"},{"key":"e_1_3_3_2_81_2","unstructured":"Hao Wen Hongming Wang Jiaxuan Liu and Yuanchun Li. 2024. DroidBot-GPT: GPT-powered UI Automation for Android. arxiv:https:\/\/arXiv.org\/abs\/2304.07061\u00a0[cs.SE]"},{"key":"e_1_3_3_2_82_2","doi-asserted-by":"publisher","unstructured":"Ian\u00a0H. Witten Radford\u00a0M. Neal and John\u00a0G. Cleary. 1987. Arithmetic coding for data compression. Commun. ACM 30 6 (jun 1987) 520\u2013540. 10.1145\/214762.214771","DOI":"10.1145\/214762.214771"},{"key":"e_1_3_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_3_2_84_2","unstructured":"Shengqiong Wu Hao Fei Leigang Qu Wei Ji and Tat-Seng Chua. 2023. NExT-GPT: Any-to-Any Multimodal LLM. arxiv:https:\/\/arXiv.org\/abs\/2309.05519\u00a0[cs.AI]"},{"key":"e_1_3_3_2_85_2","unstructured":"Guangxuan Xiao Ji Lin Mickael Seznec Hao Wu Julien Demouth and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arxiv:https:\/\/arXiv.org\/abs\/2211.10438\u00a0[cs.CL]"},{"key":"e_1_3_3_2_86_2","unstructured":"Guangxuan Xiao Yuandong Tian Beidi Chen Song Han and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv (2023)."},{"key":"e_1_3_3_2_87_2","unstructured":"Le Xiao and Xiaolin Chen. 2023. Enhancing LLM with Evolutionary Fine Tuning for News Summary Generation. arxiv:https:\/\/arXiv.org\/abs\/2307.02839\u00a0[cs.CL]"},{"key":"e_1_3_3_2_88_2","unstructured":"Daliang Xu Wangsong Yin Xin Jin Ying Zhang Shiyun Wei Mengwei Xu and Xuanzhe Liu. 2023. LLMCad: Fast and Scalable On-device Large Language Model Inference. arxiv:https:\/\/arXiv.org\/abs\/2309.04255\u00a0[cs.NI]"},{"key":"e_1_3_3_2_89_2","doi-asserted-by":"publisher","unstructured":"Daliang Xu Wangsong Yin Hao Zhang Xin Jin Ying Zhang Shiyun Wei Mengwei Xu and Xuanzhe Liu. 2025. EdgeLLM: Fast On-Device LLM Inference With Speculative Decoding. IEEE Transactions on Mobile Computing 24 4 (2025) 3256\u20133273. 10.1109\/TMC.2024.3513457","DOI":"10.1109\/TMC.2024.3513457"},{"key":"e_1_3_3_2_90_2","unstructured":"Huatao Xu Liying Han Qirui Yang Mo Li and Mani Srivastava. 2024. Penetrative AI: Making LLMs Comprehend the Physical World. arxiv:https:\/\/arXiv.org\/abs\/2310.09605\u00a0[cs.AI]"},{"key":"e_1_3_3_2_91_2","unstructured":"Mengwei Xu Wangsong Yin Dongqi Cai Rongjie Yi et\u00a0al. 2024. A Survey of Resource-efficient LLM and Multimodal Foundation Models. arxiv:https:\/\/arXiv.org\/abs\/2401.08092\u00a0[cs.LG]"},{"key":"e_1_3_3_2_92_2","unstructured":"Yi Xu Ziming Mao Xiangxi Mo Shu Liu and Ion Stoica. 2024. Pie: Pooling CPU Memory for LLM Inference. arxiv:https:\/\/arXiv.org\/abs\/2411.09317\u00a0[cs.LG] https:\/\/arxiv.org\/abs\/2411.09317"},{"key":"e_1_3_3_2_93_2","doi-asserted-by":"publisher","DOI":"10.1145\/2307636.2307648"},{"key":"e_1_3_3_2_94_2","doi-asserted-by":"crossref","unstructured":"Bufang Yang Lixing He Neiwen Ling Zhenyu Yan Guoliang Xing Xian Shuai Xiaozhe Ren and Xin Jiang. 2023. EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge. arxiv:https:\/\/arXiv.org\/abs\/2311.10986\u00a0[cs.LG]","DOI":"10.1145\/3625687.3625793"},{"key":"e_1_3_3_2_95_2","unstructured":"Rongjie Yi Liwei Guo Shiyun Wei Ao Zhou Shangguang Wang and Mengwei Xu. 2023. EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. arxiv:https:\/\/arXiv.org\/abs\/2308.14352\u00a0[cs.LG]"},{"key":"e_1_3_3_2_96_2","first-page":"521","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo\u00a0Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521\u2013538. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_3_3_2_97_2","unstructured":"Jinliang Yuan Chen Yang Dongqi Cai Shihe Wang Xin Yuan Zeling Zhang Xiang Li Dingge Zhang Hanzi Mei Xianqing Jia Shangguang Wang and Mengwei Xu. 2023. Rethinking Mobile AI Ecosystem in the LLM Era. arxiv:https:\/\/arXiv.org\/abs\/2308.14363\u00a0[cs.AI]"},{"key":"e_1_3_3_2_98_2","unstructured":"Manzil Zaheer Guru Guruganesh Avinava Dubey Joshua Ainslie Chris Alberti Santiago Ontanon Philip Pham Anirudh Ravula Qifan Wang Li Yang and Amr Ahmed. 2021. Big Bird: Transformers for Longer Sequences. arxiv:https:\/\/arXiv.org\/abs\/2007.14062\u00a0[cs.LG]"},{"key":"e_1_3_3_2_99_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi\u00a0Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit\u00a0Singh Koura Anjali Sridhar Tianlu Wang and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. arxiv:https:\/\/arXiv.org\/abs\/2205.01068\u00a0[cs.CL]"},{"key":"e_1_3_3_2_100_2","volume-title":"NIPS","author":"Zhang Xiang","year":"2015","unstructured":"Xiang Zhang, Junbo\u00a0Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NIPS."},{"key":"e_1_3_3_2_101_2","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Clark Barrett Zhangyang Wang and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arxiv:https:\/\/arXiv.org\/abs\/2306.14048\u00a0[cs.LG]"},{"key":"e_1_3_3_2_102_2","unstructured":"Yilong Zhao Chien-Yu Lin Kan Zhu Zihao Ye Lequn Chen Size Zheng Luis Ceze Arvind Krishnamurthy Tianqi Chen and Baris Kasikci. 2023. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arxiv:https:\/\/arXiv.org\/abs\/2310.19102\u00a0[cs.LG]"},{"key":"e_1_3_3_2_103_2","unstructured":"Lianmin Zheng Liangsheng Yin Zhiqiang Xie Jeff Huang Chuyue Sun Cody\u00a0Hao Yu Shiyi Cao Christos Kozyrakis Ion Stoica Joseph\u00a0E. Gonzalez Clark Barrett and Ying Sheng. 2023. Efficiently Programming Large Language Models using SGLang. arxiv:https:\/\/arXiv.org\/abs\/2312.07104\u00a0[cs.AI]"},{"key":"e_1_3_3_2_104_2","doi-asserted-by":"publisher","DOI":"10.1145\/2627369.2627647"},{"key":"e_1_3_3_2_105_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062317"}],"event":{"name":"SenSys '26: ACM\/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems","location":"Saint Malo France","acronym":"SenSys '26","sponsor":["SIGBED ACM Special Interest Group on Embedded Systems","SIGMOBILE ACM Special Interest Group on Mobility of Systems, Users, Data and Computing","IEEE CS"]},"container-title":["Proceedings of the 2026 ACM\/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774906.3800479","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T08:35:28Z","timestamp":1779006928000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3774906.3800479"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,10]]},"references-count":104,"alternative-id":["10.1145\/3774906.3800479","10.1145\/3774906"],"URL":"https:\/\/doi.org\/10.1145\/3774906.3800479","relation":{},"subject":[],"published":{"date-parts":[[2026,5,10]]},"assertion":[{"value":"2026-05-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}