{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,3]],"date-time":"2026-07-03T09:02:38Z","timestamp":1783069358653,"version":"3.54.6"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2025,7,31]]},"abstract":"<jats:p>Over the past few years, large language models (LLMs) have demonstrated remarkable performance and versatility across a variety of complex tasks. However, their deployment has been challenged by their substantial model size and computational requirements. Pruning is a effective approach to make the model parameters sparse, thereby acquire inference acceleration. While not everyone requires training or fine-tuning large models, the diverse range of applications necessitates the deployment of LLMs on different devices. Model pruning and compression have emerged as areas of deep research interest to address these challenges. In consideration of versatility and practicality, we have designed a hardware-aware pruning process for general-purpose hardware\/edge devices to enable efficient deployment and inference of LLMs. Instead of considering sparse ratio alone, we are motivated to design a pruning framework that incorporates genuine inference speed-up sensitivity from each pruning structure. Moreover, our framework breaks the layer-by-layer pruning setting and fuse several layers into one pruning stage to allow cross-layer optimization. Apart from that, we hold pragmatism by conducting compilation optimization during pruning. This step is critical because most sparsity patterns barely show distinct speed acceleration with corresponding dataflow and memory optimization. Our process operates within a post-training framework, obviating the need for additional training and thereby reducing resource requirements, while ensuring diverse inference speed and accuracy requirements on hardware.<\/jats:p>","DOI":"10.1145\/3744244","type":"journal-article","created":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T06:49:22Z","timestamp":1749883762000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["HAPE: Hardware-Aware LLM Pruning For Efficient On-Device Inference Optimization"],"prefix":"10.1145","volume":"30","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9501-9254","authenticated-orcid":false,"given":"Wenqian","family":"Zhao","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6820-7064","authenticated-orcid":false,"given":"Lancheng","family":"Zou","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8179-0996","authenticated-orcid":false,"given":"Zixiao","family":"Wang","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7994-6290","authenticated-orcid":false,"given":"Xufeng","family":"Yao","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, CUHK","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6406-4810","authenticated-orcid":false,"given":"Bei","family":"Yu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong Department of Computer Science and Engineering","place":["Hong Kong, Hong Kong"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,7,9]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11023-020-09548-1"},{"key":"e_1_3_1_3_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288."},{"key":"e_1_3_1_4_2","volume-title":"Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL\u201919)","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL\u201919)."},{"key":"e_1_3_1_5_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Annual Conference on Neural Information Processing Systems 30, 1 (2017), 5998\u20136008.","journal-title":"Annual Conference on Neural Information Processing Systems"},{"key":"e_1_3_1_6_2","volume-title":"Proceedings of the ACM Symposium on Operating Systems Principles (SOSP\u201923)","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP\u201923)."},{"key":"e_1_3_1_7_2","unstructured":"Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691. 2023."},{"key":"e_1_3_1_8_2","first-page":"4163","volume-title":"Findings of the Association for Computational Linguistics: EMNLP","author":"Kurtic Eldar","year":"2022","unstructured":"Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. 2022. The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. In Findings of the Association for Computational Linguistics: EMNLP. 4163\u20134181."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643472"},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","first-page":"4163","DOI":"10.18653\/v1\/2020.findings-emnlp.372","volume-title":"Findings of the Association for Computational Linguistics: EMNLP","author":"Jiao Xiaoqi","year":"2020","unstructured":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP. 4163\u20134174."},{"key":"e_1_3_1_11_2","article-title":"Progressively knowledge distillation via re-parameterizing diffusion reverse process","author":"Yao Xufeng","year":"2024","unstructured":"Xufeng Yao, Fanbin Lu, Yuechen Zhang, Xinyun Zhang, Wenqian Zhao, and Bei Yu. 2024. Progressively knowledge distillation via re-parameterizing diffusion reverse process. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201924).","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201924)."},{"key":"e_1_3_1_12_2","first-page":"4334","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201921)","author":"Bai Haoli","year":"2021","unstructured":"Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. 2021. BinaryBERT: Pushing the limit of BERT quantization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201921). 4334\u20134348."},{"key":"e_1_3_1_13_2","article-title":"Quantization via distillation and contrastive learning","author":"Pei Zehua","year":"2023","unstructured":"Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. 2023. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems 35, 1 (2023), 17164\u201317176.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_1_14_2","first-page":"5798","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201921)","author":"Ye Deming","year":"2021","unstructured":"Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun. 2021. TR-BERT: Dynamic token reduction for accelerating BERT inference. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201921). 5798\u20135809."},{"key":"e_1_3_1_15_2","first-page":"2246","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201920)","author":"Xin Ji","year":"2020","unstructured":"Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL\u201920). 2246\u20132251."},{"key":"e_1_3_1_16_2","unstructured":"Tim Dettmers Mike Lewis Younes Belkada and Luke Zettlemoyer. 2022. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. CoRR abs\/2208.07339. 2022."},{"key":"e_1_3_1_17_2","article-title":"Learning both weights and connections for efficient neural network","volume":"28","author":"Han Song","year":"2015","unstructured":"Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. Annual Conference on Neural Information Processing Systems 28, 1 (2015), 1135\u20131143.","journal-title":"Annual Conference on Neural Information Processing Systems"},{"key":"e_1_3_1_18_2","first-page":"9782","article-title":"Dynabert: Dynamic bert with adaptive width and depth","volume":"33","author":"Hou Lu","year":"2020","unstructured":"Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. Dynabert: Dynamic bert with adaptive width and depth. Annual Conference on Neural Information Processing Systems 33, 1 (2020), 9782\u20139793.","journal-title":"Annual Conference on Neural Information Processing Systems"},{"key":"e_1_3_1_19_2","first-page":"24101","article-title":"A fast post-training pruning framework for transformers","volume":"35","author":"Kwon Woosuk","year":"2022","unstructured":"Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. 2022. A fast post-training pruning framework for transformers. Annual Conference on Neural Information Processing Systems 35, 1 (2022), 24101\u201324116.","journal-title":"Annual Conference on Neural Information Processing Systems"},{"key":"e_1_3_1_20_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323."},{"key":"e_1_3_1_21_2","first-page":"1909","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Kwon Se Jung","year":"2020","unstructured":"Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, and Gu-Yeon Wei. 2020. Structured compression by weight encryption for unstructured pruning and quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201920). 1909\u20131918."},{"key":"e_1_3_1_22_2","first-page":"1","volume-title":"Proceedings of the ACM\/IEEE Design Automation Conference (DAC\u201920)","author":"Chen Xizi","year":"2020","unstructured":"Xizi Chen, Jingyang Zhu, Jingbo Jiang, and Chi-Ying Tsui. 2020. Tight compression: Compressing CNN model tightly through unstructured pruning and simulated annealing based permutation. In Proceedings of the ACM\/IEEE Design Automation Conference (DAC\u201920). IEEE, 1\u20136."},{"key":"e_1_3_1_23_2","first-page":"243","volume-title":"Proceedings of the ACM Great Lakes Symposium on VLSI (GLSVLSI\u201922)","author":"Yu Tianyang","year":"2022","unstructured":"Tianyang Yu, Bi Wu, Ke Chen, Chenggang Yan, and Weiqiang Liu. 2022. Data stream oriented fine-grained sparse CNN accelerator with efficient unstructured pruning strategy. In Proceedings of the ACM Great Lakes Symposium on VLSI (GLSVLSI\u201922). 243\u2013248."},{"key":"e_1_3_1_24_2","unstructured":"Mingbao Lin Rongrong Ji Yuxin Zhang Baochang Zhang Yongjian Wu and Yonghong Tian. 2020. Channel pruning via automatic structure search. arXiv preprint arXiv:2001.08565."},{"key":"e_1_3_1_25_2","first-page":"14913","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Wang Zi","year":"2021","unstructured":"Zi Wang, Chengcheng Li, and Xiangyang Wang. 2021. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 14913\u201314922."},{"key":"e_1_3_1_26_2","unstructured":"Aojun Zhou Yukun Ma Junnan Zhu Jianbo Liu Zhijie Zhang Kun Yuan Wenxiu Sun and Hongsheng Li. 2021. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010."},{"issue":"4","key":"e_1_3_1_27_2","first-page":"3999","article-title":"1xn pattern for pruning convolutional neural networks","volume":"45","author":"Lin Mingbao","year":"2022","unstructured":"Mingbao Lin, Yuxin Zhang, Yuchao Li, Bohong Chen, Fei Chao, Mengdi Wang, Shen Li, Yonghong Tian, and Rongrong Ji. 2022. 1xn pattern for pruning convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 3999\u20134008.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_28_2","unstructured":"Yuezhou Hu Kang Zhao Weiyu Huang Jianfei Chen and Jun Zhu. 2024. Accelerating transformer pre-training with 2: 4 sparsity. arXiv preprint arXiv:2404.01847."},{"key":"e_1_3_1_29_2","first-page":"7197","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Nagel Markus","year":"2020","unstructured":"Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. 2020. Up or down? Adaptive rounding for post-training quantization. In Proceedings of the International Conference on Machine Learning. PMLR, 7197\u20137206."},{"key":"e_1_3_1_30_2","unstructured":"Xinyin Ma Gongfan Fang and Xinchao Wang. 2023. Llm-pruner: On the structural pruning of large language models. Advances in Neural Information Processing Systems 36 (2023) 21702\u201321720."},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Philippe Tillet Hsiang-Tsung Kung and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10\u201319.","DOI":"10.1145\/3315508.3329973"},{"key":"e_1_3_1_32_2","first-page":"2924","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Clark Christopher","year":"2019","unstructured":"Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes\/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2924\u20132936."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6239"},{"key":"e_1_3_1_34_2","doi-asserted-by":"crossref","first-page":"4791","DOI":"10.18653\/v1\/P19-1472","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Zellers Rowan","year":"2019","unstructured":"Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4791\u20134800."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6399"},{"key":"e_1_3_1_36_2","unstructured":"Peter Clark Isaac Cowhey Oren Etzioni Tushar Khot Ashish Sabharwal Carissa Schoenick and Oyvind Tafjord. 2018. Think you have solved question answering? try arc the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1260"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","unstructured":"Leo Gao Jonathan Tow Stella Biderman Sid Black Anthony DiPofi Charles Foster Laurence Golding Jeffrey Hsu Kyle McDonell Niklas Muennighoff et al. 2021. A Framework for Few-shot Language Model Evaluation. (Sept.2021). DOI:10.5281\/zenodo.5371628","DOI":"10.5281\/zenodo.5371628"},{"key":"e_1_3_1_39_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Merity Stephen","year":"2022","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2022. Pointer sentinel mixture models. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.5555\/972470.972475"},{"key":"e_1_3_1_41_2","unstructured":"Meta. 2024. Meta Llama 3. (2024). Retrieved April 5 2024 from https:\/\/github.com\/meta-llama\/llama3"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744244","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,9]],"date-time":"2025-07-09T12:21:10Z","timestamp":1752063670000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744244"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,9]]},"references-count":40,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,7,31]]}},"alternative-id":["10.1145\/3744244"],"URL":"https:\/\/doi.org\/10.1145\/3744244","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,9]]},"assertion":[{"value":"2024-07-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}