{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:12:10Z","timestamp":1760058730598,"version":"build-2065373602"},"reference-count":62,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,4,21]],"date-time":"2025-04-21T00:00:00Z","timestamp":1745193600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003661","name":"Ministry of Trade, Industry and Energy (MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&amp;D program","doi-asserted-by":"publisher","award":["P0025661","RS-2022-00155885"],"award-info":[{"award-number":["P0025661","RS-2022-00155885"]}],"id":[{"id":"10.13039\/501100003661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010418","name":"Institute of Information &amp; communications Technology Planning &amp; Evaluation (IITP) grant funded by the Korea government (MSIT)","doi-asserted-by":"publisher","award":["P0025661","RS-2022-00155885"],"award-info":[{"award-number":["P0025661","RS-2022-00155885"]}],"id":[{"id":"10.13039\/501100010418","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.<\/jats:p>","DOI":"10.3390\/fi17040185","type":"journal-article","created":{"date-parts":[[2025,4,21]],"date-time":"2025-04-21T20:38:26Z","timestamp":1745267906000},"page":"185","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7687-026X","authenticated-orcid":false,"given":"Jaewoo","family":"Yang","sequence":"first","affiliation":[{"name":"Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0779-9778","authenticated-orcid":false,"given":"Hayun","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4587-4337","authenticated-orcid":false,"given":"Junyung","family":"Ji","sequence":"additional","affiliation":[{"name":"Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3049-035X","authenticated-orcid":false,"given":"Younghoon","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2025,4,21]]},"reference":[{"unstructured":"Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., and Dong, Z. (2023). A survey of large language models. arXiv.","key":"ref_1"},{"unstructured":"Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., and Metzler, D. (2022). Emergent abilities of large language models. arXiv.","key":"ref_2"},{"unstructured":"Shazeer, N. (2020). Glu variants improve transformer. arXiv.","key":"ref_3"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"127063","DOI":"10.1016\/j.neucom.2023.127063","article-title":"Roformer: Enhanced transformer with rotary position embedding","volume":"568","author":"Su","year":"2024","journal-title":"Neurocomputing"},{"doi-asserted-by":"crossref","unstructured":"Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebr\u00f3n, F., and Sanghai, S. (2023). Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv.","key":"ref_5","DOI":"10.18653\/v1\/2023.emnlp-main.298"},{"unstructured":"Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., and Bressand, F. (2024). Mixtral of experts. arXiv.","key":"ref_6"},{"unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv.","key":"ref_7"},{"doi-asserted-by":"crossref","unstructured":"Narang, S., Chung, H.W., Tay, Y., Fedus, L., Fevry, T., Matena, M., Malkan, K., Fiedel, N., Shazeer, N., and Lan, Z. (2021, January 7\u201311). Do Transformer Modifications Transfer Across Implementations and Applications?. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic.","key":"ref_8","DOI":"10.18653\/v1\/2021.emnlp-main.465"},{"doi-asserted-by":"crossref","unstructured":"Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. (2018, January 18\u201323). Quantization and training of neural networks for efficient integer-arithmetic-only inference. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","key":"ref_9","DOI":"10.1109\/CVPR.2018.00286"},{"doi-asserted-by":"crossref","unstructured":"Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., and Keutzer, K. (2022). A survey of quantization methods for efficient neural network inference. Low-Power Computer Vision, Chapman and Hall\/CRC.","key":"ref_10","DOI":"10.1201\/9781003162810-13"},{"unstructured":"Nagel, M., Fournarakis, M., Amjad, R.A., Bondarenko, Y., Van Baalen, M., and Blankevoort, T. (2021). A white paper on neural network quantization. arXiv.","key":"ref_11"},{"key":"ref_12","first-page":"30318","article-title":"Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale","volume":"35","author":"Dettmers","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. (2023, January 23\u201329). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA.","key":"ref_13"},{"key":"ref_14","first-page":"34278","article-title":"Intriguing properties of quantization at scale","volume":"36","author":"Ahmadian","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Bouamor, H., Pino, J., and Bali, K. (2023, January 6\u201310). Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.","key":"ref_15"},{"key":"ref_16","first-page":"75067","article-title":"Quantizable transformers: Removing outliers by helping attention heads do nothing","volume":"36","author":"Bondarenko","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Sun, M., Chen, X., Kolter, J.Z., and Liu, Z. (2024). Massive Activations in Large Language Models. arXiv.","key":"ref_17"},{"key":"ref_18","first-page":"196","article-title":"Atom: Low-bit quantization for efficient and accurate llm serving","volume":"6","author":"Zhao","year":"2024","journal-title":"Proc. Mach. Learn. Syst."},{"unstructured":"Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. (2023). Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv.","key":"ref_19"},{"key":"ref_20","first-page":"27168","article-title":"Zeroquant: Efficient and affordable post-training quantization for large-scale transformers","volume":"35","author":"Yao","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.","key":"ref_21"},{"key":"ref_22","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"doi-asserted-by":"crossref","unstructured":"Kovaleva, O., Kulshreshtha, S., Rogers, A., and Rumshisky, A. (2021, January 1\u20136). BERT Busters: Outlier Dimensions that Disrupt Transformers. Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event.","key":"ref_23","DOI":"10.18653\/v1\/2021.findings-acl.300"},{"doi-asserted-by":"crossref","unstructured":"Bondarenko, Y., Nagel, M., and Blankevoort, T. (2021, January 7\u201311). Understanding and Overcoming the Challenges of Efficient Transformer Quantization. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic.","key":"ref_24","DOI":"10.18653\/v1\/2021.emnlp-main.627"},{"unstructured":"Zong, C., Xia, F., Li, W., and Navigli, R. (2021, January 1\u20136). Positional Artefacts Propagate Through Masked Language Model Embeddings. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event.","key":"ref_25"},{"unstructured":"Moens, M.F., Huang, X., Specia, L., and Yih, S.W.T. (2021, January 7\u201311). All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, Punta Cana, Dominican Republic.","key":"ref_26"},{"unstructured":"Goldberg, Y., Kozareva, Z., and Zhang, Y. (2022, January 7\u201311). Outlier Dimensions that Disrupt Transformers are Driven by Frequency. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates.","key":"ref_27"},{"key":"ref_28","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., and Lin, X.V. (2022). Opt: Open pre-trained transformer language models. arXiv.","key":"ref_29"},{"unstructured":"Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. (2023). Efficient streaming language models with attention sinks. arXiv.","key":"ref_30"},{"unstructured":"Inui, K., Jiang, J., Ng, V., and Wan, X. (2019, January 3\u20137). Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","key":"ref_31"},{"unstructured":"Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2022). Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv.","key":"ref_32"},{"unstructured":"Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. (2023). Squeezellm: Dense-and-sparse quantization. arXiv.","key":"ref_33"},{"unstructured":"Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. (2023). Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv.","key":"ref_34"},{"key":"ref_35","first-page":"4396","article-title":"Quip: 2-bit quantization of large language models with guarantees","volume":"196","author":"Chee","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"unstructured":"Yao, Z., Li, C., Wu, X., Youn, S., and He, Y. (2023). A comprehensive study on post-training quantization for large language models. arXiv.","key":"ref_36"},{"unstructured":"Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. (2023). Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv.","key":"ref_37"},{"doi-asserted-by":"crossref","unstructured":"Liu, R., Bai, H., Lin, H., Li, Y., Gao, H., Xu, Z., Hou, L., Yao, J., and Yuan, C. (2024). IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact. arXiv.","key":"ref_38","DOI":"10.18653\/v1\/2024.findings-acl.460"},{"doi-asserted-by":"crossref","unstructured":"Son, S., Park, W., Han, W., Kim, K., and Lee, J. (2024). Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization. arXiv.","key":"ref_39","DOI":"10.18653\/v1\/2024.emnlp-main.134"},{"unstructured":"Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020, January 13\u201318). On layer normalization in the transformer architecture. Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event.","key":"ref_40"},{"unstructured":"Baevski, A., and Auli, M. (May, January 30). Adaptive Input Representations for Neural Language Modeling. Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada.","key":"ref_41"},{"key":"ref_42","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 11\u201314). Identity mappings in deep residual networks. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part IV 14.","key":"ref_43"},{"key":"ref_44","first-page":"6000","article-title":"Attention is all you need","volume":"11","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_45","first-page":"606","article-title":"Efficiently scaling transformer inference","volume":"5","author":"Pope","year":"2023","journal-title":"Proc. Mach. Learn. Syst."},{"doi-asserted-by":"crossref","unstructured":"Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv.","key":"ref_46","DOI":"10.18653\/v1\/N19-4009"},{"unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and Bhosale, S. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv.","key":"ref_47"},{"unstructured":"Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., and Fan, A. (2024). The llama 3 herd of models. arXiv.","key":"ref_48"},{"unstructured":"Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., and Saulnier, L. (2023). Mistral 7B. arXiv.","key":"ref_49"},{"doi-asserted-by":"crossref","unstructured":"Kim, D., Park, C., Kim, S., Lee, W., Song, W., Kim, Y., Kim, H., Kim, Y., Lee, H., and Kim, J. (2023). Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv.","key":"ref_50","DOI":"10.18653\/v1\/2024.naacl-industry.3"},{"unstructured":"Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi\u00e8re, M., Kale, M.S., and Love, J. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv.","key":"ref_51"},{"doi-asserted-by":"crossref","unstructured":"Bisk, Y., Zellers, R., Gao, J., and Choi, Y. (2020, January 7\u201312). Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","key":"ref_52","DOI":"10.1609\/aaai.v34i05.6239"},{"doi-asserted-by":"crossref","unstructured":"Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N.Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern\u00e1ndez, R. (2016, January 7\u201312). The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany.","key":"ref_53","DOI":"10.18653\/v1\/P16-1144"},{"unstructured":"Korhonen, A., Traum, D., and M\u00e0rquez, L. (August, January 28). HellaSwag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy.","key":"ref_54"},{"doi-asserted-by":"crossref","unstructured":"Sakaguchi, K., Bras, R.L., Bhagavatula, C., and Choi, Y. (2019). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv.","key":"ref_55","DOI":"10.1609\/aaai.v34i05.6399"},{"unstructured":"Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., and Le Noac\u2019h, A. (2021, September 02). A Framework for Few-Shot Language Model Evaluation. Available online: https:\/\/github.com\/EleutherAI\/lm-evaluation-harness.","key":"ref_56"},{"unstructured":"Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models. arXiv.","key":"ref_57"},{"unstructured":"Bellagente, M., Tow, J., Mahan, D., Phung, D., Zhuravinskyi, M., Adithyan, R., Baicoianu, J., Brooks, B., Cooper, N., and Datta, A. (2024). Stable LM 2 1.6 B Technical Report. arXiv.","key":"ref_58"},{"unstructured":"Team, M.N. (2023, May 05). Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Available online: https:\/\/www.databricks.com\/blog\/mpt-7b.","key":"ref_59"},{"unstructured":"Biderman, S., Schoelkopf, H., Anthony, Q.G., Bradley, H., O\u2019Brien, K., Hallahan, E., Khan, M.A., Purohit, S., Prashanth, U.S., and Raff, E. (2023, January 23\u201329). Pythia: A suite for analyzing large language models across training and scaling. Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA.","key":"ref_60"},{"unstructured":"Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \u00c9., Hesslow, D., Launay, J., and Malartic, Q. (2023). The falcon series of open language models. arXiv.","key":"ref_61"},{"unstructured":"Mojan, J., and S\u00e9bastien, B. (2023, December 12). Phi-2: The Surprising Power of Small Language Models. Available online: https:\/\/www.microsoft.com\/en-us\/research\/blog\/phi-2-the-surprising-power-of-small-language-models\/.","key":"ref_62"}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/185\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:18:57Z","timestamp":1760030337000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/185"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,21]]},"references-count":62,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["fi17040185"],"URL":"https:\/\/doi.org\/10.3390\/fi17040185","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2025,4,21]]}}}