{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T16:04:40Z","timestamp":1775837080262,"version":"3.50.1"},"reference-count":90,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T00:00:00Z","timestamp":1742428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Robot. AI"],"abstract":"<jats:p>The exceptional performance of general-purpose large models has driven various industries to focus on developing domain-specific models. However, large models are not only time-consuming and labor-intensive during the training phase but also have very high hardware requirements during the inference phase, such as large memory and high computational power. These requirements pose considerable challenges for the practical deployment of large models. As these challenges intensify, model compression has become a vital research focus to address these limitations. This paper presents a comprehensive review of the evolution of model compression techniques, from their inception to future directions. To meet the urgent demand for efficient deployment, we delve into several compression methods\u2014such as quantization, pruning, low-rank decomposition, and knowledge distillation\u2014emphasizing their fundamental principles, recent advancements, and innovative strategies. By offering insights into the latest developments and their implications for practical applications, this review serves as a valuable technical resource for researchers and practitioners, providing a range of strategies for model deployment and laying the groundwork for future advancements in model compression.<\/jats:p>","DOI":"10.3389\/frobt.2025.1518965","type":"journal-article","created":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T08:20:09Z","timestamp":1742458809000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":29,"title":["A survey of model compression techniques: past, present, and future"],"prefix":"10.3389","volume":"12","author":[{"given":"Defu","family":"Liu","sequence":"first","affiliation":[]},{"given":"Yixiao","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Zhe","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Yi","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Changlin","family":"Han","sequence":"additional","affiliation":[]},{"given":"Jinkai","family":"Tian","sequence":"additional","affiliation":[]},{"given":"Ruihao","family":"Li","sequence":"additional","affiliation":[]},{"given":"Wei","family":"Yi","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,3,20]]},"reference":[{"key":"B1","article-title":"On-policy distillation of language models: learning from self-generated mistakes","author":"Agarwal","year":"2024"},{"key":"B2","doi-asserted-by":"publisher","first-page":"3090","DOI":"10.1109\/icpr48806.2021.9412897","article-title":"Neuron-based network pruning based on majority voting","author":"Alqahtani","year":"2021","journal-title":"2020 25th Int. Conf. Pattern Recognit. (ICPR)"},{"key":"B3","first-page":"254","article-title":"Stronger generalization bounds for deep nets via a compression approach","author":"Arora","year":"2018","journal-title":"Int. Conf. Mach. Learn."},{"key":"B4","article-title":"Dual lottery ticket hypothesis","author":"Bai","year":"2022"},{"key":"B5","first-page":"837","article-title":"Self pruning Gaussian synapse networks for behavior based robots","author":"Becerra","year":"2002"},{"key":"B6","article-title":"Language models are few-shot learners","author":"Brown","year":"2020","journal-title":"arXiv"},{"key":"B7","article-title":"ProxylessNAS: direct neural architecture search on target task and hardware","author":"Cai","year":"2019"},{"key":"B8","doi-asserted-by":"publisher","first-page":"200336","DOI":"10.1016\/j.iswa.2024.200336","article-title":"Claude 2.0 large language model: tackling a real-world classification problem with a new iterative prompt engineering approach","volume":"21","author":"Caruccio","year":"2024","journal-title":"Intelligent Syst. Appl."},{"key":"B9","first-page":"15834","article-title":"The lottery ticket hypothesis for pre-trained bert networks","volume":"33","author":"Chen","year":"2020"},{"key":"B10","first-page":"26609","article-title":"The elastic lottery ticket hypothesis","volume":"34","author":"Chen","year":"2021","journal-title":"arXiv"},{"key":"B11","first-page":"10944","article-title":"Teq: trainable equivalent transformation for quantization of llms","volume":"2310","author":"Cheng","year":"2023","journal-title":"arXiv"},{"key":"B12","article-title":"Binaryconnect: training deep neural networks with binary weights during propagations","volume":"28","author":"Courbariaux","year":"2015","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B13","article-title":"Proving the lottery ticket hypothesis for convolutional neural networks","author":"Da Cunha","year":"2022"},{"key":"B14","article-title":"Predicting parameters in deep learning","volume":"26","author":"Denil","year":"2013","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B15","article-title":"Exploiting linear structure within convolutional networks for efficient evaluation","author":"Denton","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"B16","article-title":"GPT3.int8: 8-bit matrix multiplication for transformers at scale","author":"Dettmers","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"B17","article-title":"Qlora: efficient finetuning of quantized llms","volume":"36","author":"Dettmers","year":"2024","journal-title":"arXiv"},{"key":"B18","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423","article-title":"Bert: pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018","journal-title":"arXiv"},{"key":"B19","article-title":"The lottery ticket hypothesis: finding sparse, trainable neural networks","author":"Frankle","year":"2019"},{"key":"B20","doi-asserted-by":"publisher","first-page":"3259","DOI":"10.5555\/3524938.3525243","article-title":"Linear mode connectivity and the lottery ticket hypothesis","author":"Frankle","year":"2020","journal-title":"Int. Conf. Mach. Learn."},{"key":"B21","article-title":"OPTQ: accurate quantization for generative pre-trained transformers","author":"Frantar","year":"2023"},{"key":"B22","doi-asserted-by":"publisher","first-page":"762","DOI":"10.1109\/cvpr46437.2021.00082","article-title":"The lottery ticket hypothesis for object recognition","author":"Girish","year":"2021","journal-title":"Proc. IEEE\/CVF Conf. Comput. Vis. pattern Recognit."},{"key":"B23","article-title":"MiniLLM: knowledge distillation of large language models","author":"Gu","year":"2024"},{"key":"B24","article-title":"Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding","author":"Han","year":"2016"},{"key":"B25","first-page":"1135","article-title":"Learning both weights and connections for efficient neural networks","volume":"1","author":"Han","year":"2015","journal-title":"Proc. 28th Int. Conf. Neural Inf. Process. Syst. -"},{"key":"B26","article-title":"Comparing biases for minimal network construction with back-propagation","volume":"1","author":"Hanson","year":"1988","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B27","article-title":"Second order derivatives for network pruning: optimal brain surgeon","volume":"5","author":"Hassibi","year":"1992","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B28","first-page":"245","article-title":"Reshaping deep neural network for fast decoding by node-pruning","author":"He","year":"2014"},{"key":"B29","article-title":"Distilling the knowledge in a neural network","author":"Hinton","year":"2015","journal-title":"arXiv"},{"key":"B30","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1607.03250","article-title":"Network trimming: a data-driven neuron pruning approach towards efficient deep architectures","author":"Hu","year":"2016","journal-title":"arXiv"},{"key":"B31","doi-asserted-by":"publisher","DOI":"10.5555\/3157382.3157557","article-title":"Binarized neural networks","volume":"29","author":"Hubara","year":"2016","journal-title":"arXiv"},{"key":"B32","doi-asserted-by":"publisher","first-page":"2704","DOI":"10.1109\/cvpr.2018.00286","article-title":"Quantization and training of neural networks for efficient integer-arithmetic-only inference","author":"Jacob","year":"2018","journal-title":"Proc. IEEE Conf. Comput. Vis. pattern Recognit."},{"key":"B33","first-page":"88.1","article-title":"Speeding up convolutional neural networks with low rank expansions","author":"Jaderberg","year":"2014"},{"key":"B34","article-title":"Just chop: embarrassingly simple llm compression","author":"Jha","year":"2024","journal-title":"arXiv"},{"key":"B35","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","article-title":"Imagenet: a large-scale hierarchical image database","author":"Jia","year":"2009","journal-title":"2009 IEEE Conf. Comput. Vis. pattern Recognit."},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2308.09723","article-title":"Finequant: unlocking efficiency with fine-grained weight-only quantization for llms","author":"Kim","year":"2023","journal-title":"arXiv"},{"key":"B37","article-title":"Quantizing deep convolutional networks for efficient inference: a whitepaper","author":"Krishnamoorthi","year":"2018","journal-title":"arXiv"},{"key":"B38","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B39","doi-asserted-by":"publisher","first-page":"4163","DOI":"10.18653\/v1\/2022.emnlp-main.279","article-title":"The optimal bert surgeon: scalable and accurate second-order pruning for large language models","author":"Kurtic","year":"2022","journal-title":"arXiv"},{"key":"B40","doi-asserted-by":"publisher","first-page":"2554","DOI":"10.1109\/cvpr.2016.280","article-title":"Fast convnets using group-wise brain damage","author":"Lebedev","year":"2016","journal-title":"Proc. IEEE Conf. Comput. Vis. pattern Recognit."},{"key":"B41","article-title":"Optimal brain damage","volume":"2","author":"LeCun","year":"1989","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B42","article-title":"Ternary weight networks","author":"Li","year":"2016","journal-title":"arXiv"},{"key":"B43","article-title":"Pruning filters for efficient convnets","author":"Li","year":"2017"},{"key":"B44","article-title":"Dividemix: learning with noisy labels as semi-supervised learning","author":"Li","year":"2020"},{"key":"B45","article-title":"Loftq: LoRA-fine-tuning-aware quantization for large language models","author":"Li","year":"2024"},{"key":"B46","first-page":"87","article-title":"Awq: activation-aware weight quantization for on-device llm compression and acceleration","volume":"6","author":"Lin","year":"2024","journal-title":"Proc. Mach. Learn. Syst."},{"key":"B47","first-page":"7021","article-title":"Group Fisher pruning for practical network compression","author":"Liu","year":"2021","journal-title":"Int. Conf. Mach. Learn."},{"key":"B48","article-title":"Rethinking the value of network pruning","author":"Liu","year":"2019"},{"key":"B49","doi-asserted-by":"publisher","first-page":"1131","DOI":"10.1109\/cvpr.2017.126","article-title":"Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification","author":"Lu","year":"2017","journal-title":"2017 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)"},{"key":"B50","article-title":"An entropy-based pruning method for cnn compression","author":"Luo","year":"2017","journal-title":"arXiv"},{"key":"B51","doi-asserted-by":"publisher","first-page":"2525","DOI":"10.1109\/tpami.2018.2858232","article-title":"Thinet: pruning cnn filters for a thinner net","volume":"41","author":"Luo","year":"2019","journal-title":"IEEE Trans. Pattern Analysis Mach. Intell."},{"key":"B52","article-title":"Diversity networks: neural network compression using determinantal point processes","author":"Mariet","year":"2016"},{"key":"B53","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.1038\/s41467-018-04316-3","article-title":"Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science","volume":"9","author":"Mocanu","year":"2018","journal-title":"Nat. Commun."},{"key":"B54","article-title":"Skeletonization: a technique for trimming the fat from a network via relevance assessment","volume":"1","author":"Mozer","year":"1988","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B55","doi-asserted-by":"publisher","first-page":"7197","DOI":"10.5555\/3524938.3525605","article-title":"Up or down? adaptive rounding for post-training quantization","author":"Nagel","year":"2020","journal-title":"Proc. 37th Int. Conf. Mach. Learn."},{"key":"B56","doi-asserted-by":"publisher","first-page":"1325","DOI":"10.1109\/iccv.2019.00141","article-title":"Data-free quantization through weight equalization and bias correction","author":"Nagel","year":"2019","journal-title":"Proc. IEEE\/CVF Int. Conf. Comput. Vis."},{"key":"B57","article-title":"Gpt-4 technical report","author":"Achiam","year":"2024","journal-title":"arXiv"},{"key":"B58","doi-asserted-by":"publisher","first-page":"488","DOI":"10.1109\/34.391394","article-title":"Deformable kernels for early vision","volume":"17","author":"Perona","year":"1995","journal-title":"IEEE Trans. Pattern Analysis Mach. Intell."},{"key":"B59","doi-asserted-by":"publisher","first-page":"2163","DOI":"10.1109\/access.2015.2494536","article-title":"Channel-level acceleration of deep face representations","volume":"3","author":"Polyak","year":"2015","journal-title":"IEEE Access"},{"key":"B60","article-title":"Improving language understanding by generative pre-training","author":"Radford","year":"2018","journal-title":"arXiv"},{"key":"B61","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"arXiv"},{"key":"B62","doi-asserted-by":"publisher","first-page":"525","DOI":"10.1007\/978-3-319-46493-0_32","article-title":"Xnor-net: imagenet classification using binary convolutional neural networks","author":"Rastegari","year":"2016","journal-title":"Eur. Conf. Comput. Vis."},{"key":"B63","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1109\/72.248452","article-title":"Pruning algorithms-a survey","volume":"4","author":"Reed","year":"1993","journal-title":"IEEE Trans. Neural Netw."},{"key":"B64","doi-asserted-by":"crossref","first-page":"2754","DOI":"10.1109\/CVPR.2013.355","article-title":"Learning separable filters","author":"Rigamonti","year":"2013","journal-title":"2013 IEEE Conf. Comput. Vis. Pattern Recognit."},{"key":"B65","doi-asserted-by":"publisher","first-page":"481","DOI":"10.1088\/0954-898x_4_4_005","article-title":"Pruning divide and conquer networks","volume":"4","author":"Romaniuk","year":"1993","journal-title":"Netw. Comput. Neural Syst."},{"key":"B66","doi-asserted-by":"crossref","first-page":"6655","DOI":"10.1109\/ICASSP.2013.6638949","article-title":"Low-rank matrix factorization for deep neural network training with high-dimensional output targets","author":"Sainath","year":"2013","journal-title":"2013 IEEE Int. Conf. Acoust. Speech Signal Process."},{"key":"B67","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00474","article-title":"Mobilenetv2: inverted residuals and linear bottlenecks","author":"Sandler","year":"2019","journal-title":"arXiv Prepr. arXiv 1801.04381"},{"key":"B68","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan","year":"2015"},{"key":"B69","first-page":"31.1","article-title":"Data-free parameter pruning for deep neural networks","author":"Srinivas","year":"2015"},{"key":"B70","doi-asserted-by":"publisher","first-page":"1981","DOI":"10.1080\/01621459.2021.1895175","article-title":"Consistent sparse deep learning: theory and computation","volume":"117","author":"Sun","year":"2022","journal-title":"J. Am. Stat. Assoc."},{"key":"B71","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.09288","article-title":"Llama 2: open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv Prepr. arXiv:2307.09288"},{"key":"B72","doi-asserted-by":"publisher","first-page":"17402","DOI":"10.5555\/3600270.3601535","article-title":"Outlier suppression: pushing the limit of low-bit transformer language models","volume":"35","author":"Wei","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"B73","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/b978-1-4832-1448-1.50016-0","article-title":"Back-propagation, weight-elimination and time series prediction","author":"Weigend","year":"","journal-title":"Connect. models"},{"key":"B74","first-page":"2374","article-title":"Generalization by weight-elimination applied to currency exchange rate prediction","author":"Weigend","year":""},{"key":"B75","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157329","article-title":"Learning structured sparsity in deep neural networks","volume":"29","author":"Wen","year":"2016","journal-title":"arXiv"},{"key":"B76","doi-asserted-by":"publisher","first-page":"4820","DOI":"10.1109\/cvpr.2016.521","article-title":"Quantized convolutional neural networks for mobile devices","author":"Wu","year":"2016","journal-title":"Proc. IEEE Conf. Comput. Vis. pattern Recognit."},{"key":"B77","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157329","article-title":"Zeroquant-fp: a leap forward in llms post-training w4a8 quantization using floating-point formats","author":"Wu","year":"2023","journal-title":"arXiv"},{"key":"B78","first-page":"38087","article-title":"Smoothquant: accurate and efficient post-training quantization for large language models","author":"Xiao","year":"2023","journal-title":"Int. Conf. Mach. Learn."},{"key":"B79","article-title":"Onebit: towards extremely low-bit large language models","author":"Xu","year":"","journal-title":"arXiv Prepr. arXiv:2402.11295"},{"key":"B80","article-title":"QA-loRA: quantization-aware low-rank adaptation of large language models","author":"Xu","year":""},{"key":"B81","doi-asserted-by":"publisher","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"Yann","year":"2015","journal-title":"nature"},{"key":"B82","article-title":"Zeroquant-v2: exploring post-training quantization in llms from comprehensive study to low rank compensation","author":"Yao","year":"2023","journal-title":"arXiv"},{"key":"B83","doi-asserted-by":"publisher","first-page":"27168","DOI":"10.5555\/3600270.3602240","article-title":"Zeroquant: efficient and affordable post-training quantization for large-scale transformers","volume":"35","author":"Yao","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"B84","article-title":"Rptq: reorder-based post-training quantization for large language models","author":"Yuan","year":"2023","journal-title":"arXiv"},{"key":"B85","article-title":"Glm-130b: an open bilingual pre-trained model","author":"Zeng","year":"2023"},{"key":"B86","first-page":"1","article-title":"Integer or floating point? new outlooks for low-bit quantization on large language models","author":"Zhang","year":"2024"},{"key":"B87","doi-asserted-by":"publisher","first-page":"201","DOI":"10.1007\/s11263-021-01543-y","article-title":"Towards compact 1-bit cnns via bayesian learning","volume":"130","author":"Zhao","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"B88","article-title":"Incremental network quantization: towards lossless cnns with low-precision weights","author":"Zhou","year":"2017"},{"key":"B89","first-page":"662","article-title":"Less is more: towards compact cnns","author":"Zhou","year":""},{"key":"B90","article-title":"Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients","author":"Zhou","year":"","journal-title":"arXiv"}],"container-title":["Frontiers in Robotics and AI"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2025.1518965\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T08:20:57Z","timestamp":1742458857000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2025.1518965\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,20]]},"references-count":90,"alternative-id":["10.3389\/frobt.2025.1518965"],"URL":"https:\/\/doi.org\/10.3389\/frobt.2025.1518965","relation":{},"ISSN":["2296-9144"],"issn-type":[{"value":"2296-9144","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,20]]},"article-number":"1518965"}}