{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T03:42:53Z","timestamp":1775187773811,"version":"3.50.1"},"reference-count":121,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T00:00:00Z","timestamp":1754006400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T00:00:00Z","timestamp":1754006400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The rapid advancement of large language models (LLMs) has driven significant progress in natural language processing (NLP) and related domains. However, their deployment remains constrained by challenges related to computation, memory, and energy efficiency\u2014particularly in real-world applications. This work presents a comprehensive review of state-of-the-art compression techniques, including pruning, quantization, knowledge distillation, and neural architecture search (NAS), which collectively aim to reduce model size, enhance inference speed, and lower energy consumption while maintaining performance. A robust evaluation framework is introduced, incorporating traditional metrics, such as accuracy and perplexity (PPL), alongside advanced criteria including latency-accuracy trade-offs, parameter efficiency, multi-objective Pareto optimization, and fairness considerations. This study further highlights trends and challenges, such as fairness-aware compression, robustness against adversarial attacks, and hardware-specific optimizations. Additionally, NAS-driven strategies are explored as a means to design task-aware, hardware-adaptive architectures that enhance LLM compression efficiency. Hybrid and adaptive methods are also examined to dynamically optimize computational efficiency across diverse deployment scenarios. This work not only synthesizes recent advancements and identifies open problems but also proposes a structured research roadmap to guide the development of efficient, scalable, and equitable LLMs. By bridging the gap between compression research and real-world deployment, this study offers actionable insights for optimizing LLMs across a range of environments, including mobile devices and large-scale cloud infrastructures.<\/jats:p>","DOI":"10.1007\/s40747-025-02019-z","type":"journal-article","created":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T07:03:57Z","timestamp":1754031837000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["A review of state-of-the-art techniques for large language model compression"],"prefix":"10.1007","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6390-9340","authenticated-orcid":false,"given":"Pierre V.","family":"Dantas","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6235-4272","authenticated-orcid":false,"given":"Lucas C.","family":"Cordeiro","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3095-0042","authenticated-orcid":false,"given":"Waldir S. S.","family":"Junior","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,1]]},"reference":[{"key":"2019_CR1","unstructured":"Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) Albert: a lite BERT for self-supervised learning of language representations. In: International conference on learning representations. https:\/\/openreview.net\/forum?id=H1eA7AEtvS"},{"key":"2019_CR2","doi-asserted-by":"publisher","unstructured":"Lian S, Zhao K, Liu X, Lei X, Yang B, Zhang W, Wang K, Liu Z (2024) What is the best model? Application-driven evaluation for large language models. Springer Nature Singapore, pp 67\u201379. https:\/\/doi.org\/10.1007\/978-981-97-9437-9_6","DOI":"10.1007\/978-981-97-9437-9_6"},{"issue":"27","key":"2019_CR3","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2311878121","volume":"121","author":"Y Bahri","year":"2024","unstructured":"Bahri Y, Dyer E, Kaplan J, Lee J, Sharma U (2024) Explaining neural scaling laws. Proc Natl Acad Sci 121(27):e2311878121. https:\/\/doi.org\/10.1073\/pnas.2311878121","journal-title":"Proc Natl Acad Sci"},{"key":"2019_CR4","doi-asserted-by":"publisher","unstructured":"Zhou Z, Ning X, Hong K et al (2024) A survey on efficient inference for large language models. https:\/\/doi.org\/10.48550\/arxiv.2404.14294","DOI":"10.48550\/arxiv.2404.14294"},{"key":"2019_CR5","doi-asserted-by":"publisher","unstructured":"Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le Quoc V (2019) MnasNet: platform-aware neural architecture search for mobile. In 2019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE, pp 2815\u20132823. https:\/\/doi.org\/10.48550\/arxiv.1807.11626","DOI":"10.48550\/arxiv.1807.11626"},{"key":"2019_CR6","doi-asserted-by":"publisher","unstructured":"Lang J, Guo Z, Huang S (2024) A comprehensive study on quantization techniques for large language models. In: 2024 4th international conference on artificial intelligence, robotics, and communication (ICAIRC). IEEE, pp 224\u2013231. https:\/\/doi.org\/10.1109\/icairc64177.2024.10899941","DOI":"10.1109\/icairc64177.2024.10899941"},{"key":"2019_CR7","doi-asserted-by":"publisher","unstructured":"Howard A, Sandler M, Chen B, Wang W, Chen L-C, Tan M, Chu G, Vasudevan V, Zhu Y, Pang R, Adam H, Le Quoc V (2019) Searching for MobileNetV3. In 2019 IEEE\/CVF international conference on computer vision (ICCV). IEEE, pp 1314\u20131324. https:\/\/doi.org\/10.1109\/iccv.2019.00140","DOI":"10.1109\/iccv.2019.00140"},{"key":"2019_CR8","doi-asserted-by":"publisher","unstructured":"Wan Z, Wang X, Liu C, Alam S et al (2023) Efficient large language models: a survey. https:\/\/doi.org\/10.48550\/arxiv.2312.03863","DOI":"10.48550\/arxiv.2312.03863"},{"key":"2019_CR9","doi-asserted-by":"publisher","unstructured":"Ukil Arijit, Sahu Ishan, Biswas Mridul, Pal Arpan, Majumdar Angshul (April 2024) Structured Lottery Ticket Hypothesis for Effective Deep Neural Network Model Size Reduction. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), page 330\u2013334. IEEE. https:\/\/doi.org\/10.1109\/icasspw62465.2024.10625908","DOI":"10.1109\/icasspw62465.2024.10625908"},{"key":"2019_CR10","doi-asserted-by":"publisher","unstructured":"Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2022) A survey of quantization methods for efficient neural network inference. Chapman and Hall\/CRC, pp 291\u2013326. https:\/\/doi.org\/10.1201\/9781003162810-13","DOI":"10.1201\/9781003162810-13"},{"key":"2019_CR11","doi-asserted-by":"publisher","unstructured":"Wu C, Wu F, Huang Y (2021) One teacher is enough? Pre-trained language model distillation from multiple teachers. In: Findings of the association for computational linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2021.findings-acl.387","DOI":"10.18653\/v1\/2021.findings-acl.387"},{"key":"2019_CR12","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2023.102990","volume":"144","author":"KT Chitty-Venkata","year":"2023","unstructured":"Chitty-Venkata KT, Mittal S, Emani M, Vishwanath V, Somani AK (2023) A survey of techniques for optimizing transformer inference. J Syst Architect 144:102990. https:\/\/doi.org\/10.1016\/j.sysarc.2023.102990","journal-title":"J Syst Architect"},{"issue":"4","key":"2019_CR13","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1507","volume":"13","author":"R Verdecchia","year":"2023","unstructured":"Verdecchia R, Sallou J, Cruz L (2023) A systematic review of Green AI. WIREs Data Min Knowl Discov 13(4):e1507. https:\/\/doi.org\/10.1002\/widm.1507","journal-title":"WIREs Data Min Knowl Discov"},{"issue":"1","key":"2019_CR14","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1109\/jssc.2016.2616357","volume":"52","author":"Y-H Chen","year":"2017","unstructured":"Chen Y-H, Krishna T, Emer JS, Sze V (2017) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits 52(1):127\u2013138. https:\/\/doi.org\/10.1109\/jssc.2016.2616357","journal-title":"IEEE J Solid-State Circuits"},{"issue":"2","key":"2019_CR15","doi-asserted-by":"publisher","first-page":"134","DOI":"10.11627\/jksie.2024.47.2.134","volume":"47","author":"S-G Kim","year":"2024","unstructured":"Kim S-G, Noh K, Hahn H, Choi BK (2024) Instruction fine-tuning and LoRA combined approach for optimizing large language models. J Soc Korea Ind Syst Eng 47(2):134\u2013146. https:\/\/doi.org\/10.11627\/jksie.2024.47.2.134","journal-title":"J Soc Korea Ind Syst Eng"},{"key":"2019_CR16","doi-asserted-by":"publisher","unstructured":"LeCun Y, Denker J, Solla S (1989) Optimal brain damage. In: Touretzky D (ed) Advances in neural information processing systems, vol 2. Morgan-Kaufmann. https:\/\/doi.org\/10.5555\/109230.109298","DOI":"10.5555\/109230.109298"},{"key":"2019_CR17","doi-asserted-by":"publisher","unstructured":"Hassibi B, Stork DG, Wolff GJ (1993) Optimal brain surgeon and general network pruning. In: IEEE international conference on neural networks. IEEE, p 293\u2013299. https:\/\/doi.org\/10.1109\/icnn.1993.298572","DOI":"10.1109\/icnn.1993.298572"},{"key":"2019_CR18","doi-asserted-by":"publisher","unstructured":"Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. In: Proceedings of the 29th international conference on neural information processing systems\u2014volume 1, NIPS\u201915. MIT Press, Cambridge, pp 1135\u20131143. https:\/\/doi.org\/10.5555\/2969239.2969366","DOI":"10.5555\/2969239.2969366"},{"key":"2019_CR19","doi-asserted-by":"publisher","unstructured":"Lin J, Rao Y, Lu J, Zhou J (2017) Runtime neural pruning. In: Proceedings of the 31st international conference on neural information processing systems, NIPS\u201917. Curran Associates Inc., Red Hook, pp 2178\u20132188. https:\/\/doi.org\/10.5555\/3294771.3294979","DOI":"10.5555\/3294771.3294979"},{"issue":"1","key":"2019_CR20","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1109\/tnn.2008.2005605","volume":"20","author":"F Scarselli","year":"2009","unstructured":"Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61\u201380. https:\/\/doi.org\/10.1109\/tnn.2008.2005605","journal-title":"IEEE Trans Neural Netw"},{"key":"2019_CR21","doi-asserted-by":"publisher","unstructured":"Liu J (2022) Distilling graph neural networks. Springer International Publishing, pp 131\u2013151. https:\/\/doi.org\/10.1007\/978-3-031-16174-2_7","DOI":"10.1007\/978-3-031-16174-2_7"},{"key":"2019_CR22","doi-asserted-by":"publisher","unstructured":"Guo Z, Zhang C, Fan Y, Tian Y, Zhang C, Chawla NV (2023) Boosting Graph Neural Networks via Adaptive Knowledge Distillation. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, no 6, pp 7793\u20137801. https:\/\/doi.org\/10.1609\/aaai.v37i6.25944","DOI":"10.1609\/aaai.v37i6.25944"},{"issue":"3","key":"2019_CR23","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1007\/s13735-024-00334-8","volume":"13","author":"P Kumar","year":"2024","unstructured":"Kumar P (2024) Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challenges. Int J Multimed Inf Retr 13(3):26. https:\/\/doi.org\/10.1007\/s13735-024-00334-8","journal-title":"Int J Multimed Inf Retr"},{"key":"2019_CR24","doi-asserted-by":"publisher","unstructured":"Taveekitworachai P, Suntichaikul P, Nukoolkit C, Thawonmas R (2024) Speed up! Cost-effective large language model for ADAS via knowledge distillation. In: 2024 IEEE intelligent vehicles symposium (IV). IEEE, pp 1933\u20131938. https:\/\/doi.org\/10.1109\/iv55156.2024.10588799","DOI":"10.1109\/iv55156.2024.10588799"},{"issue":"6","key":"2019_CR25","doi-asserted-by":"publisher","first-page":"1446","DOI":"10.1109\/72.471364","volume":"6","author":"G Dundar","year":"1995","unstructured":"Dundar G, Rose K (1995) The effects of quantization on multilayer neural networks. IEEE Trans Neural Netw 6(6):1446\u20131451. https:\/\/doi.org\/10.1109\/72.471364","journal-title":"IEEE Trans Neural Netw"},{"key":"2019_CR26","doi-asserted-by":"publisher","unstructured":"Withagen H (1994) Reducing the effect of quantization by weight scaling. In: Proceedings of 1994 IEEE international conference on neural networks (ICNN-94), vol 4, ICNN-94. IEEE, pp 2128\u20132130. https:\/\/doi.org\/10.1109\/icnn.1994.374544","DOI":"10.1109\/icnn.1994.374544"},{"issue":"1","key":"2019_CR27","doi-asserted-by":"publisher","first-page":"269","DOI":"10.1145\/2654822.2541967","volume":"42","author":"T Chen","year":"2014","unstructured":"Chen T, Zidong D, Sun N, Wang J, Chengyong W, Chen Y, Temam O (2014) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput Archit News 42(1):269\u2013284. https:\/\/doi.org\/10.1145\/2654822.2541967","journal-title":"ACM SIGARCH Comput Archit News"},{"key":"2019_CR28","doi-asserted-by":"publisher","DOI":"10.48550\/arxiv.1510.00149","author":"S Han","year":"2015","unstructured":"Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural networks with pruning. Trained Quant Huffman Coding. https:\/\/doi.org\/10.48550\/arxiv.1510.00149","journal-title":"Trained Quant Huffman Coding"},{"key":"2019_CR29","doi-asserted-by":"publisher","unstructured":"Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) XNOR-Net: imageNet classification using binary convolutional neural networks. Springer International Publishing, pp 525\u2013542. https:\/\/doi.org\/10.1007\/978-3-319-46493-0_32","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"2019_CR30","doi-asserted-by":"publisher","unstructured":"Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition. IEEE, pp 2704\u20132713. https:\/\/doi.org\/10.1109\/cvpr.2018.00286","DOI":"10.1109\/cvpr.2018.00286"},{"key":"2019_CR31","doi-asserted-by":"publisher","unstructured":"Courbariaux M, Hubara I, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or $$-1$$. https:\/\/doi.org\/10.48550\/arxiv.1602.02830","DOI":"10.48550\/arxiv.1602.02830"},{"key":"2019_CR32","doi-asserted-by":"publisher","unstructured":"Ding J, Wu J, Wu H (2017) Three-means ternary quantization. Springer International Publishing, pp 235\u2013245. https:\/\/doi.org\/10.1007\/978-3-319-70096-0_25","DOI":"10.1007\/978-3-319-70096-0_25"},{"issue":"4","key":"2019_CR33","doi-asserted-by":"publisher","first-page":"485","DOI":"10.1109\/jproc.2020.2976475","volume":"108","author":"BL Deng","year":"2020","unstructured":"Deng BL, Li G, Han S, Shi L, Xie Y (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc IEEE 108(4):485\u2013532. https:\/\/doi.org\/10.1109\/jproc.2020.2976475","journal-title":"Proc IEEE"},{"key":"2019_CR34","doi-asserted-by":"publisher","first-page":"139113","DOI":"10.1109\/access.2024.3465631","volume":"12","author":"U Bibi","year":"2024","unstructured":"Bibi U, Mazhar M, Sabir D, Butt M, Hassan A, Ghazanfar MA, Khan AA, Abdul W (2024) Advances in pruning and quantization for natural language processing. IEEE Access 12:139113\u2013139128. https:\/\/doi.org\/10.1109\/access.2024.3465631","journal-title":"IEEE Access"},{"key":"2019_CR35","doi-asserted-by":"publisher","DOI":"10.1016\/j.csi.2024.103906","volume":"92","author":"\u00c1D Reguero","year":"2025","unstructured":"Reguero \u00c1D, Mart\u00ednez-Fern\u00e1ndez S, Verdecchia R (2025) Energy-efficient neural network training through runtime layer freezing, model quantization, and early stopping. Comput Stand Interfaces 92:103906. https:\/\/doi.org\/10.1016\/j.csi.2024.103906","journal-title":"Comput Stand Interfaces"},{"key":"2019_CR36","doi-asserted-by":"publisher","unstructured":"Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. https:\/\/doi.org\/10.48550\/arxiv.1503.02531","DOI":"10.48550\/arxiv.1503.02531"},{"key":"2019_CR37","doi-asserted-by":"publisher","unstructured":"Xu K, Rui L, Li Y, Gu L (2020) Feature normalized knowledge distillation for image classification. Springer International Publishing, pp 664\u2013680. https:\/\/doi.org\/10.1007\/978-3-030-58595-2_40","DOI":"10.1007\/978-3-030-58595-2_40"},{"issue":"2","key":"2019_CR38","doi-asserted-by":"publisher","first-page":"99","DOI":"10.1162\/106365602320169811","volume":"10","author":"KO Stanley","year":"2002","unstructured":"Stanley KO, Miikkulainen R (2002) Evolving neural networks through augmenting topologies. Evol Comput 10(2):99\u2013127. https:\/\/doi.org\/10.1162\/106365602320169811","journal-title":"Evol Comput"},{"key":"2019_CR39","doi-asserted-by":"publisher","unstructured":"Nakai K, Matsubara T, Uehara K (2020) Att-DARTS: differentiable neural architecture search for attention. In: 2020 international joint conference on neural networks (IJCNN). IEEE, pp 1\u20138. https:\/\/doi.org\/10.1109\/ijcnn48605.2020.9207447","DOI":"10.1109\/ijcnn48605.2020.9207447"},{"key":"2019_CR40","doi-asserted-by":"publisher","unstructured":"Kim Y, Li Y, Park H, Venkatesha Y, Yin R, Panda P (2022) Exploring lottery ticket hypothesis in spiking neural networks. Springer Nature Switzerland, pp 102\u2013120. https:\/\/doi.org\/10.1007\/978-3-031-19775-8_7","DOI":"10.1007\/978-3-031-19775-8_7"},{"key":"2019_CR41","doi-asserted-by":"publisher","unstructured":"Wang Y, Qin Y, Liu L, Wei S, Yin S (2021) HPPU: an energy-efficient sparse DNN training processor with hybrid weight pruning. In: 2021 IEEE 3rd international conference on artificial intelligence circuits and systems (AICAS). IEEE, pp 1\u20134. https:\/\/doi.org\/10.1109\/aicas51828.2021.9458410","DOI":"10.1109\/aicas51828.2021.9458410"},{"issue":"6","key":"2019_CR42","doi-asserted-by":"publisher","first-page":"1789","DOI":"10.1007\/s11263-021-01453-z","volume":"129","author":"J Gou","year":"2021","unstructured":"Gou J, Baosheng Yu, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129(6):1789\u20131819. https:\/\/doi.org\/10.1007\/s11263-021-01453-z","journal-title":"Int J Comput Vis"},{"issue":"120","key":"2019_CR43","first-page":"1","volume":"23","author":"W Fedus","year":"2022","unstructured":"Fedus W, Zoph B, Shazeer N (2022) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1\u201339 (http:\/\/jmlr.org\/papers\/v23\/21-0998.html)","journal-title":"J Mach Learn Res"},{"key":"2019_CR44","doi-asserted-by":"publisher","unstructured":"Choukroun Y, Kravchik E, Yang F, Kisilev P (2019) Low-bit quantization of neural networks for efficient inference. In: 2019 IEEE\/CVF international conference on computer vision workshop (ICCVW). IEEE. https:\/\/doi.org\/10.1109\/iccvw.2019.00363","DOI":"10.1109\/iccvw.2019.00363"},{"key":"2019_CR45","unstructured":"Lewis P, Perez E, Piktus A, Petroni F et\u00a0al (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H,\u00a0Ranzato M,\u00a0Hadsell R, Balcan MF,\u00a0Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates, Inc, pp 9459\u20139474. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/6b493230205f780e1bc26945df7481e5-Paper.pdf"},{"key":"2019_CR46","unstructured":"Dao T, Fu D, Ermon S, Rudra A, R\u00e9 C (2022) FlashAttention: fast and memory-efficient exact attention with IO-awareness. In:\u00a0Koyejo S,\u00a0Mohamed S,\u00a0Agarwal A,\u00a0Belgrave D,\u00a0Cho K,\u00a0Oh A (eds) Advances in neural information processing systems, vol 35. Curran Associates, Inc, pp 16344\u201316359. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf"},{"key":"2019_CR47","doi-asserted-by":"publisher","unstructured":"Bai Y, Jones A et\u00a0al (2022) Training a helpful and harmless assistant with reinforcement learning from human feedback. https:\/\/doi.org\/10.48550\/arxiv.2204.05862","DOI":"10.48550\/arxiv.2204.05862"},{"key":"2019_CR48","doi-asserted-by":"publisher","unstructured":"Bai Y, Kadavath S, Kundu S, Askell A et\u00a0al (2022) Constitutional AI: harmlessness from AI feedback. https:\/\/doi.org\/10.48550\/arxiv.2212.08073","DOI":"10.48550\/arxiv.2212.08073"},{"key":"2019_CR49","unstructured":"Anthropic (2024) Claude 3 technical overview. https:\/\/www.anthropic.com\/index\/claude-3-family"},{"key":"2019_CR50","doi-asserted-by":"publisher","unstructured":"Kim B-K, Kim G, Kim T-H, Castells T, Choi S, Shin J, Song H-K (2024) Shortened LLaMA: depth pruning for large language models with comparison of retraining methods. https:\/\/doi.org\/10.48550\/arxiv.2402.02834","DOI":"10.48550\/arxiv.2402.02834"},{"key":"2019_CR51","doi-asserted-by":"publisher","unstructured":"Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L (2023) QLoRA: efficient finetuning of quantized LLMs. https:\/\/doi.org\/10.48550\/arxiv.2305.14314","DOI":"10.48550\/arxiv.2305.14314"},{"key":"2019_CR52","doi-asserted-by":"publisher","unstructured":"Lin J, Tang J, Tang H, Yang S, Chen W-M, Wang W-C, Xiao G, Dang X, Gan C, Han S (2023) AWQ: activation-aware weight quantization for LLM compression and acceleration. https:\/\/doi.org\/10.48550\/arxiv.2306.00978","DOI":"10.48550\/arxiv.2306.00978"},{"key":"2019_CR53","doi-asserted-by":"publisher","unstructured":"Muralidharan S, Sreenivas ST, Joshi R, Chochowski M, Patwary M, Shoeybi M, Catanzaro B, Kautz J, Molchanov P (2024) Compact language models via pruning and knowledge distillation. https:\/\/doi.org\/10.48550\/arxiv.2407.14679","DOI":"10.48550\/arxiv.2407.14679"},{"key":"2019_CR54","unstructured":"Anil R, Dai Andrew\u00a0M et al (2023) PaLM 2 technical report"},{"key":"2019_CR55","doi-asserted-by":"publisher","unstructured":"Hsieh C-Y, Li C-L, Yeh C-K, Nakhost H, Fujii Y, Ratner A, Krishna R, Lee C-Y, Pfister T (2023) Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. In: Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2023.findings-acl.507","DOI":"10.18653\/v1\/2023.findings-acl.507"},{"key":"2019_CR56","unstructured":"Papers with Code (2024) Papers with code: state-of-the-art leaderboards for machine learning. https:\/\/paperswithcode.com\/"},{"key":"2019_CR57","unstructured":"MLCommons (2024) MLPerf: benchmarking machine learning performance. https:\/\/mlcommons.org\/"},{"key":"2019_CR58","unstructured":"Face Hugging (2024) Hugging face model hub: pre-trained models and benchmarks for NLP. https:\/\/huggingface.co\/models"},{"key":"2019_CR59","unstructured":"Stanford\u00a0Center for Research\u00a0on Foundation\u00a0Models (2024) HELM: holistic evaluation of language models. https:\/\/crfm.stanford.edu\/helm\/"},{"key":"2019_CR60","unstructured":"Magic Neural (2024) SparseZoo: pre-trained sparse models and benchmarks for deep learning. https:\/\/sparsezoo.neuralmagic.com\/"},{"key":"2019_CR61","unstructured":"Stanford University (2024) DAWNBench: end-to-end deep learning benchmark and competition. https:\/\/dawn.cs.stanford.edu\/benchmark\/"},{"key":"2019_CR62","doi-asserted-by":"publisher","unstructured":"Li Z, Song Z (2024) Structured pruning strategy based on interpretable machine learning. In: 2024 5th international conference on computer engineering and application (ICCEA). IEEE, pp 801\u2013804. https:\/\/doi.org\/10.1109\/iccea62105.2024.10603526","DOI":"10.1109\/iccea62105.2024.10603526"},{"issue":"12","key":"2019_CR63","doi-asserted-by":"publisher","first-page":"10558","DOI":"10.1109\/tpami.2024.3447085","volume":"46","author":"H Cheng","year":"2024","unstructured":"Cheng H, Zhang M, Shi JQ (2024) A survey on deep neural network pruning: taxonomy, comparison, analysis, and recommendations. IEEE Trans Pattern Anal Mach Intell 46(12):10558\u201310578. https:\/\/doi.org\/10.1109\/tpami.2024.3447085","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"3","key":"2019_CR64","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1145\/3007787.3001163","volume":"44","author":"S Han","year":"2016","unstructured":"Han S, Liu X, Mao H, Jing P, Pedram A, Horowitz MA, Dally WJ (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput Archit News 44(3):243\u2013254. https:\/\/doi.org\/10.1145\/3007787.3001163","journal-title":"ACM SIGARCH Comput Archit News"},{"key":"2019_CR65","doi-asserted-by":"publisher","unstructured":"Wang Y, Ma Z, Yang C (2023) A new mixed precision quantization algorithm for neural networks based on reinforcement learning. In: 2023 IEEE 6th international conference on pattern recognition and artificial intelligence (PRAI). IEEE, pp 1016\u20131020. https:\/\/doi.org\/10.1109\/prai59366.2023.10331945","DOI":"10.1109\/prai59366.2023.10331945"},{"issue":"12","key":"2019_CR66","doi-asserted-by":"publisher","first-page":"2295","DOI":"10.1109\/jproc.2017.2761740","volume":"105","author":"V Sze","year":"2017","unstructured":"Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295\u20132329. https:\/\/doi.org\/10.1109\/jproc.2017.2761740","journal-title":"Proc IEEE"},{"key":"2019_CR67","doi-asserted-by":"publisher","unstructured":"Cai H, Zhu L, Han S (2018) ProxylessNAS: direct neural architecture search on target task and hardware. https:\/\/doi.org\/10.48550\/arxiv.1812.00332","DOI":"10.48550\/arxiv.1812.00332"},{"key":"2019_CR68","doi-asserted-by":"publisher","unstructured":"Zhu K, He Y-Y, Jianxin W (2023) Quantized feature distillation for network quantization. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, no 9, pp 11452\u201311460. https:\/\/doi.org\/10.1609\/aaai.v37i9.26354","DOI":"10.1609\/aaai.v37i9.26354"},{"key":"2019_CR69","doi-asserted-by":"publisher","unstructured":"Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) TinyBERT: distilling BERT for natural language understanding. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2020.findings-emnlp.372","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"issue":"8","key":"2019_CR70","doi-asserted-by":"publisher","first-page":"2798","DOI":"10.1007\/s11263-024-02002-0","volume":"132","author":"Y Liu","year":"2024","unstructured":"Liu Y, Cao J, Li B, Weiming H, Ding J, Li L, Maybank S (2024) Cross-architecture knowledge distillation. Int J Comput Vis 132(8):2798\u20132824. https:\/\/doi.org\/10.1007\/s11263-024-02002-0","journal-title":"Int J Comput Vis"},{"key":"2019_CR71","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107958","volume":"239","author":"SK Kang","year":"2022","unstructured":"Kang SK, Lee D, Kweon W, Hwanjo Yu (2022) Personalized knowledge distillation for recommender system. Knowl-Based Syst 239:107958. https:\/\/doi.org\/10.1016\/j.knosys.2021.107958","journal-title":"Knowl-Based Syst"},{"key":"2019_CR72","doi-asserted-by":"publisher","unstructured":"Yan W, Liu A, Huang Z, Zhang S, Van Gool L (2021) Neural architecture search as sparse Supernet. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, no 12, pp 10379\u201310387. https:\/\/doi.org\/10.1609\/aaai.v35i12.17243","DOI":"10.1609\/aaai.v35i12.17243"},{"key":"2019_CR73","doi-asserted-by":"publisher","unstructured":"Yang Y, Shen Z, Li H, Lin Z (2023) Optimization-inspired manual architecture design and neural architecture search. Science China Inf Sci. https:\/\/doi.org\/10.1007\/s11432-021-3527-7","DOI":"10.1007\/s11432-021-3527-7"},{"issue":"1","key":"2019_CR74","doi-asserted-by":"publisher","first-page":"231","DOI":"10.1007\/s00521-024-10445-2","volume":"37","author":"A Cassimon","year":"2024","unstructured":"Cassimon A, Mercelis S, Mets K (2024) Scalable reinforcement learning-based neural architecture search. Neural Comput Appl 37(1):231\u2013261. https:\/\/doi.org\/10.1007\/s00521-024-10445-2","journal-title":"Neural Comput Appl"},{"key":"2019_CR75","doi-asserted-by":"publisher","unstructured":"Dong N, Xu M, Liang X, Jiang Y, Dai W, Xing E (2019) Neural architecture search for adversarial medical image segmentation. Springer International Publishing, pp 828\u2013836. https:\/\/doi.org\/10.1007\/978-3-030-32226-7_92","DOI":"10.1007\/978-3-030-32226-7_92"},{"issue":"5s","key":"2019_CR76","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3609385","volume":"22","author":"H Mousavi","year":"2023","unstructured":"Mousavi H, Loni M, Alibeigi M, Daneshtalab M (2023) DASS: differentiable architecture search for sparse neural networks. ACM Trans Embed Comput Syst 22(5s):1\u201321. https:\/\/doi.org\/10.1145\/3609385","journal-title":"ACM Trans Embed Comput Syst"},{"key":"2019_CR77","doi-asserted-by":"publisher","unstructured":"Wei Z, Wang X, Zhu W (2021) AutoIAS: automatic integrated architecture searcher for click-trough rate prediction. In: Proceedings of the 30th ACM international conference on information & knowledge management, CIKM\u201921. ACM, pp 2101\u20132110. https:\/\/doi.org\/10.1145\/3459637.3482234","DOI":"10.1145\/3459637.3482234"},{"issue":"1","key":"2019_CR78","doi-asserted-by":"publisher","first-page":"423","DOI":"10.1007\/s11063-021-10638-z","volume":"54","author":"AA Nevzorov","year":"2021","unstructured":"Nevzorov AA, Perchenko SV, Stankevich DA (2021) Truncation: a new approach to neural network reduction. Neural Process Lett 54(1):423\u2013435. https:\/\/doi.org\/10.1007\/s11063-021-10638-z","journal-title":"Neural Process Lett"},{"key":"2019_CR79","doi-asserted-by":"publisher","unstructured":"Zafrir O, Boudoukh G, Izsak P, Wasserblat M (2019) Q8BERT: quantized 8Bit BERT. In: 2019 5th workshop on energy efficient machine learning and cognitive computing\u2014NeurIPS edition (EMC2-NIPS). IEEE, pp 36\u201339. https:\/\/doi.org\/10.1109\/emc2-nips53020.2019.00016","DOI":"10.1109\/emc2-nips53020.2019.00016"},{"key":"2019_CR80","doi-asserted-by":"publisher","DOI":"10.1145\/3699518","author":"C Yang","year":"2024","unstructured":"Yang C, Zhu Y, Lu W, Wang Y, Chen Q, Gao C, Yan B, Chen Y (2024) Survey on knowledge distillation for large language models: methods, evaluation, and application. ACM Trans Intell Syst Technol. https:\/\/doi.org\/10.1145\/3699518","journal-title":"ACM Trans Intell Syst Technol"},{"key":"2019_CR81","doi-asserted-by":"publisher","unstructured":"Widmann T, Merkle F, Nocker M, Sch\u00f6ttle P (2023) Pruning for power: optimizing energy efficiency in IoT with neural network pruning. Springer Nature Switzerland, pp 251\u2013263. https:\/\/doi.org\/10.1007\/978-3-031-34204-2_22","DOI":"10.1007\/978-3-031-34204-2_22"},{"issue":"3","key":"2019_CR82","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1145\/3698365.3698367","volume":"4","author":"Y Zhao","year":"2024","unstructured":"Zhao Y, Guo T (2024) Carbon-efficient neural architecture search. ACM SIGEnergy Energy Inform Rev 4(3):3\u20139. https:\/\/doi.org\/10.1145\/3698365.3698367","journal-title":"ACM SIGEnergy Energy Inform Rev"},{"key":"2019_CR83","doi-asserted-by":"publisher","unstructured":"Yuan Y, Shi J, Zhang Z, Chen K, Zhang J, Stoico V, Malavolta I (2024) The impact of knowledge distillation on the energy consumption and runtime efficiency of NLP models. In: Proceedings of the IEEE\/ACM 3rd international conference on AI engineering\u2014software engineering for AI, CAIN 2024. ACM, pp 129\u2013133. https:\/\/doi.org\/10.1145\/3644815.3644966","DOI":"10.1145\/3644815.3644966"},{"key":"2019_CR84","unstructured":"TensorFlow (2024) TensorFlow model optimization toolkit (TF-MOT). https:\/\/www.tensorflow.org\/model_optimization"},{"issue":"2","key":"2019_CR85","doi-asserted-by":"publisher","first-page":"1733","DOI":"10.1109\/tcc.2022.3160129","volume":"11","author":"Z Tao","year":"2023","unstructured":"Tao Z, Xia Q, Cheng S, Li Q (2023) An efficient and robust cloud-based deep learning with knowledge distillation. IEEE Trans Cloud Comput 11(2):1733\u20131745. https:\/\/doi.org\/10.1109\/tcc.2022.3160129","journal-title":"IEEE Trans Cloud Comput"},{"key":"2019_CR86","doi-asserted-by":"publisher","unstructured":"Kim J, Chang S, Kwak N (2021) PQK: model compression via pruning, quantization, and knowledge distillation. In: Interspeech 2021. ISCA. https:\/\/doi.org\/10.21437\/interspeech.2021-248","DOI":"10.21437\/interspeech.2021-248"},{"key":"2019_CR87","doi-asserted-by":"publisher","unstructured":"Harma SB, Chakraborty A et\u00a0al (2024) Effective interplay between sparsity and quantization: from theory to practice. https:\/\/doi.org\/10.48550\/arxiv.2405.20935","DOI":"10.48550\/arxiv.2405.20935"},{"key":"2019_CR88","doi-asserted-by":"publisher","unstructured":"Wang T, Wang K, Cai H, Lin J, Liu Z, Wang H, Lin Y, Han S (2020) APQ: joint search for network architecture, pruning and quantization policy. In: 2020 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE, pp 2075\u20132084. https:\/\/doi.org\/10.1109\/cvpr42600.2020.00215","DOI":"10.1109\/cvpr42600.2020.00215"},{"issue":"22","key":"2019_CR89","doi-asserted-by":"publisher","first-page":"11804","DOI":"10.1007\/s10489-024-05747-w","volume":"54","author":"PV Dantas","year":"2024","unstructured":"Dantas PV, Sabino W, da Silva L, Cordeiro C, Carvalho CB (2024) A comprehensive review of model compression techniques in machine learning. Appl Intell 54(22):11804\u201311844. https:\/\/doi.org\/10.1007\/s10489-024-05747-w","journal-title":"Appl Intell"},{"key":"2019_CR90","doi-asserted-by":"publisher","unstructured":"Zmora N, Jacob G, Zlotnik L, Elharar B, Novik G (2019) Neural network distiller: a python package for DNN compression research. https:\/\/doi.org\/10.48550\/arxiv.1910.12232","DOI":"10.48550\/arxiv.1910.12232"},{"key":"2019_CR91","doi-asserted-by":"publisher","unstructured":"He Y, Lin J, Liu Z, Wang H, Li L-J, Han S (2018) AMC: AutoML for model compression and acceleration on mobile devices. Springer International Publishing, pp 815\u2013832. https:\/\/doi.org\/10.1007\/978-3-030-01234-2_48","DOI":"10.1007\/978-3-030-01234-2_48"},{"key":"2019_CR92","doi-asserted-by":"publisher","unstructured":"Lin J, Chen W-M, Lin Y, Cohn J, Gan C, Han S (2020) MCUNet: tiny deep learning on IoT devices. https:\/\/doi.org\/10.48550\/arxiv.2007.10319","DOI":"10.48550\/arxiv.2007.10319"},{"key":"2019_CR93","doi-asserted-by":"publisher","first-page":"1556","DOI":"10.1162\/tacl_a_00704","volume":"12","author":"X Zhu","year":"2024","unstructured":"Zhu X, Li J, Liu Y, Ma C, Wang W (2024) a survey on model compression for large language models. Trans Assoc Comput Linguist 12:1556\u20131577. https:\/\/doi.org\/10.1162\/tacl_a_00704","journal-title":"Trans Assoc Comput Linguist"},{"key":"2019_CR94","doi-asserted-by":"publisher","unstructured":"Huang J, Zhang J, Wang Q, Han W, Zhang Y (2024) Exploring advanced methodologies in security evaluation for large language models. Springer Nature Singapore, pp 135\u2013150. https:\/\/doi.org\/10.1007\/978-981-97-4519-7_10","DOI":"10.1007\/978-981-97-4519-7_10"},{"key":"2019_CR95","doi-asserted-by":"publisher","unstructured":"Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010) The balanced accuracy and its posterior distribution. In: 2010 20th international conference on pattern recognition. IEEE, pp 3121\u20133124. https:\/\/doi.org\/10.1109\/icpr.2010.764","DOI":"10.1109\/icpr.2010.764"},{"key":"2019_CR96","doi-asserted-by":"publisher","unstructured":"Yacouby R, Axman D (2020) Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In: Proceedings of the first workshop on evaluation and comparison of NLP systems. Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2020.eval4nlp-1.9","DOI":"10.18653\/v1\/2020.eval4nlp-1.9"},{"issue":"1","key":"2019_CR97","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1080\/00949655.2023.2238235","volume":"94","author":"Y Liu","year":"2023","unstructured":"Liu Y, Li Y, Xie D (2023) Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks. J Stat Comput Simul 94(1):183\u2013203. https:\/\/doi.org\/10.1080\/00949655.2023.2238235","journal-title":"J Stat Comput Simul"},{"key":"2019_CR98","doi-asserted-by":"publisher","unstructured":"Deng S, Wu L, Shi G, Zhang H, Hu W, Dong R (2021) Emotion class-wise aware loss for image emotion classification. Springer International Publishing, pp 553\u2013564. https:\/\/doi.org\/10.1007\/978-3-030-93046-2_47","DOI":"10.1007\/978-3-030-93046-2_47"},{"key":"2019_CR99","doi-asserted-by":"publisher","first-page":"291","DOI":"10.1162\/tacl_a_00461","volume":"10","author":"L Xue","year":"2022","unstructured":"Xue L, Barua A, Constant N, Al-Rfou R, Narang S, Kale M, Roberts A, Raffel C (2022) ByT5: towards a token-free future with pre-trained byte-to-byte models. Trans Assoc Comput Linguist 10:291\u2013306. https:\/\/doi.org\/10.1162\/tacl_a_00461","journal-title":"Trans Assoc Comput Linguist"},{"key":"2019_CR100","doi-asserted-by":"publisher","first-page":"1768","DOI":"10.1016\/j.procs.2023.01.155","volume":"218","author":"S Kumar","year":"2023","unstructured":"Kumar S, Solanki A (2023) A natural language processing system using CWS pipeline for extraction of linguistic features. Proc Comput Sci 218:1768\u20131777. https:\/\/doi.org\/10.1016\/j.procs.2023.01.155","journal-title":"Proc Comput Sci"},{"key":"2019_CR101","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2023.111741","volume":"203","author":"M Evtikhiev","year":"2023","unstructured":"Evtikhiev M, Bogomolov E, Sokolov Y, Bryksin T (2023) Out of the BLEU: how should we assess quality of the code generation models? J Syst Softw 203:111741. https:\/\/doi.org\/10.1016\/j.jss.2023.111741","journal-title":"J Syst Softw"},{"key":"2019_CR102","doi-asserted-by":"publisher","unstructured":"Kim JY, Jo SH, Sang-hyun H, Kim KH, Kang YJ, Jeong SC (2024) Comparison of AI model serving efficiency: response time and memory usage analysis. In: Human factors in design, engineering, and computing, AHFE Hawaii, vol 159. AHFE International. https:\/\/doi.org\/10.54941\/ahfe1005580","DOI":"10.54941\/ahfe1005580"},{"issue":"7","key":"2019_CR103","doi-asserted-by":"publisher","first-page":"5113","DOI":"10.1007\/s10462-020-09816-7","volume":"53","author":"T Choudhary","year":"2020","unstructured":"Choudhary T, Mishra V, Goswami A, Sarangapani J (2020) A comprehensive survey on model compression and acceleration. Artif Intell Rev 53(7):5113\u20135155. https:\/\/doi.org\/10.1007\/s10462-020-09816-7","journal-title":"Artif Intell Rev"},{"key":"2019_CR104","doi-asserted-by":"publisher","unstructured":"Zhao P, Zhang J, Peng B, Wang L, Wei Y, Liu Y, Liu L (2023) ARBiBench: benchmarking adversarial robustness of binarized neural networks. https:\/\/doi.org\/10.48550\/arxiv.2312.13575","DOI":"10.48550\/arxiv.2312.13575"},{"issue":"5","key":"2019_CR105","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2021.102642","volume":"58","author":"K Makhlouf","year":"2021","unstructured":"Makhlouf K, Zhioua S, Palamidessi C (2021) Machine learning fairness notions: bridging the gap with real-world applications. Inf Process Manag 58(5):102642. https:\/\/doi.org\/10.1016\/j.ipm.2021.102642","journal-title":"Inf Process Manag"},{"key":"2019_CR106","doi-asserted-by":"publisher","unstructured":"Na S, Jeong G, Ahn BH, Young J, Krishna T, Kim H (2024) Understanding performance implications of LLM inference on CPUs. In: 2024 IEEE international symposium on workload characterization (IISWC), pp 169\u2013180. https:\/\/doi.org\/10.1109\/IISWC63097.2024.00024","DOI":"10.1109\/IISWC63097.2024.00024"},{"key":"2019_CR107","doi-asserted-by":"publisher","unstructured":"Wei W, Ren X, Tang J, Wang Q, Su L, Cheng S, Wang J, Yin D, Huang C (2024) LLMRec: large language models with graph augmentation for recommendation. In: Proceedings of the 17th ACM international conference on web search and data mining, WSDM\u201924. ACM, pp 806\u2013815. https:\/\/doi.org\/10.1145\/3616855.3635853","DOI":"10.1145\/3616855.3635853"},{"key":"2019_CR108","doi-asserted-by":"publisher","unstructured":"Zhao X, Liu H, Fan W, Liu H, Tang J, Wang C (2021) AutoLoss: automated loss function search in recommendations. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, KDD\u201921. ACM, pp 3959\u20133967. https:\/\/doi.org\/10.1145\/3447548.3467208","DOI":"10.1145\/3447548.3467208"},{"key":"2019_CR109","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2024.105764","volume":"195","author":"MJ Hasan","year":"2025","unstructured":"Hasan MJ, Rahman F, Mohammed N (2025) OptimCLM: optimizing clinical language models for predicting patient outcomes via knowledge distillation, pruning and quantization. Int J Med Inform 195:105764. https:\/\/doi.org\/10.1016\/j.ijmedinf.2024.105764","journal-title":"Int J Med Inform"},{"key":"2019_CR110","doi-asserted-by":"publisher","unstructured":"Bellamy RKE, Dey K, Hind M, Hoffman SC, Houde S, Kannan K, Lohia P, Martino J, Mehta S, Mojsilovic A, Nagar S, Natesan Ramamurthy K, Richards J, Saha D, Sattigeri P, Singh M, Varshney KR, Zhang Y (2019) AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J Res Dev 63(4\/5):4:1\u20134:15. https:\/\/doi.org\/10.1147\/jrd.2019.2942287","DOI":"10.1147\/jrd.2019.2942287"},{"key":"2019_CR111","doi-asserted-by":"publisher","unstructured":"Kamal M, Talbert D (2024) Beyond size and accuracy: the impact of model compression on fairness. In: The international FLAIRS conference proceedings, p 37. https:\/\/doi.org\/10.32473\/flairs.37.1.135617","DOI":"10.32473\/flairs.37.1.135617"},{"key":"2019_CR112","doi-asserted-by":"publisher","unstructured":"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. https:\/\/doi.org\/10.48550\/arxiv.1901.08746","DOI":"10.48550\/arxiv.1901.08746"},{"key":"2019_CR113","unstructured":"Mosca E, Szigeti F, Tragianni S, Gallagher D, Groh G (2022) SHAP-based explanation methods: a review for NLP interpretability. In: Proceedings\u2014international conference on computational linguistics COLING, vol 29, no 1, pp 4593\u20134603. https:\/\/aclanthology.org\/2022.coling-1.406"},{"key":"2019_CR114","doi-asserted-by":"publisher","unstructured":"Said A, Yahyaoui A, Abdellatif T (2024) HIPAA and GDPR compliance in IoT healthcare systems. Springer Nature Switzerland, pp 198\u2013209. https:\/\/doi.org\/10.1007\/978-3-031-55729-3_16","DOI":"10.1007\/978-3-031-55729-3_16"},{"key":"2019_CR115","doi-asserted-by":"publisher","unstructured":"He X, Pal S, Amarnath A, Feng S, Park D-H, Rovinski A, Ye H, Chen Y, Dreslinski R, Mudge T (2020) Sparse-TPU: adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM international conference on supercomputing, ICS \u201920. ACM. https:\/\/doi.org\/10.1145\/3392717.3392751","DOI":"10.1145\/3392717.3392751"},{"issue":"1","key":"2019_CR116","doi-asserted-by":"publisher","first-page":"10081","DOI":"10.1038\/s41598-025-94205-9","volume":"15","author":"T Suwannaphong","year":"2025","unstructured":"Suwannaphong T, Jovan F, Craddock I, McConville R (2025) Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Sci Rep 15(1):10081. https:\/\/doi.org\/10.1038\/s41598-025-94205-9","journal-title":"Sci Rep"},{"key":"2019_CR117","unstructured":"Pool J (2020) Accelerating sparsity in the nvidia ampere architecture. https:\/\/developer.download.nvidia.com\/video\/gputechconf\/gtc\/2020\/presentations\/s22085-accelerating-sparsity-in-the-nvidia-ampere-architecture%E2%80%8B.pdf. Accessed 25 May 2025"},{"key":"2019_CR118","doi-asserted-by":"publisher","unstructured":"Zhang H, XiaolongShi XS, Sun J, Sun G (2024) Structured pruning for large language models using coupled components elimination and minor fine-tuning. In: Findings of the association for computational linguistics: NAACL 2024. Association for Computational Linguistics, pp 1\u201312. https:\/\/doi.org\/10.18653\/v1\/2024.findings-naacl.1","DOI":"10.18653\/v1\/2024.findings-naacl.1"},{"key":"2019_CR119","unstructured":"Kurti\u0107 E, Marques A, Kurtz M, Alistarh D, Pandit S (2025) 2:4 Sparse LLaMA: smaller models for efficient GPU inference. https:\/\/developers.redhat.com\/articles\/2025\/02\/28\/24-sparse-llama-smaller-models-efficient-gpu-inference. Accessed 25 May 2025"},{"key":"2019_CR120","doi-asserted-by":"publisher","unstructured":"Gondimalla A, Chesnut N, Thottethodi M, Vijaykumar TN (2019) SparTen: a sparse tensor accelerator for convolutional neural networks. In: Proceedings of the 52nd annual IEEE\/ACM international symposium on microarchitecture, MICRO \u201952. ACM, pp 151\u2013165. https:\/\/doi.org\/10.1145\/3352460.3358291","DOI":"10.1145\/3352460.3358291"},{"key":"2019_CR121","doi-asserted-by":"publisher","unstructured":"Wang H, Ma S, Wang R, Wei F (2024) Q-sparse: all large language models can be fully sparsely-activated. https:\/\/doi.org\/10.48550\/arxiv.2407.10969","DOI":"10.48550\/arxiv.2407.10969"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-02019-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-025-02019-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-02019-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T05:26:25Z","timestamp":1756617985000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-025-02019-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,1]]},"references-count":121,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["2019"],"URL":"https:\/\/doi.org\/10.1007\/s40747-025-02019-z","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,1]]},"assertion":[{"value":"14 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 July 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 August 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All authors certify that they have no affiliations or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Adherence to the ethical standards and principles of informed consent, we confirm that all data used in this manuscript were collected and used with due regard for ethical norms. The study was conducted transparently, and all authors made substantial contributions as detailed in the Authors Contribution Statement section. We affirm our unwavering commitment to ethical research practices and informed consent, assuring that this research aligns with established ethical guidelines.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical and informed consent for the data used"}}],"article-number":"407"}}