{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T23:36:27Z","timestamp":1781652987221,"version":"3.54.5"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,2,8]],"date-time":"2024-02-08T00:00:00Z","timestamp":1707350400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022YFF0711900"],"award-info":[{"award-number":["2022YFF0711900"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["No. 61831022, No.62276264, and No. 62306087"],"award-info":[{"award-number":["No. 61831022, No.62276264, and No. 62306087"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Yunnan Provincial Major Science and Technology Special Plan Projects","award":["202202AD080004"],"award-info":[{"award-number":["202202AD080004"]}]},{"DOI":"10.13039\/501100004739","name":"Youth Innovation Promotion Association CAS","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004739","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100007129","name":"Natural Science Foundation of Shandong Province","doi-asserted-by":"crossref","award":["ZR2023QF154"],"award-info":[{"award-number":["ZR2023QF154"]}],"id":[{"id":"10.13039\/501100007129","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2024,2,29]]},"abstract":"<jats:p>\n            Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this article, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely\n            <jats:italic>gradient-based<\/jats:italic>\n            ,\n            <jats:italic>perturbation-based<\/jats:italic>\n            , and\n            <jats:italic>feature selection<\/jats:italic>\n            methods. Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model with better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.\n          <\/jats:p>","DOI":"10.1145\/3639364","type":"journal-article","created":{"date-parts":[[2023,12,29]],"date-time":"2023-12-29T22:04:48Z","timestamp":1703887488000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression"],"prefix":"10.1145","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2816-6486","authenticated-orcid":false,"given":"Zhao","family":"Yang","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, University of Chinese Academy of Sciences, China and The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9905-9501","authenticated-orcid":false,"given":"Yuanzhe","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, University of Chinese Academy of Sciences, China and The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5200-2265","authenticated-orcid":false,"given":"Dianbo","family":"Sui","sequence":"additional","affiliation":[{"name":"School of Computer Science, Harbin Institute of Technology at Weihai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-0188-7385","authenticated-orcid":false,"given":"Yiming","family":"Ju","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, University of Chinese Academy of Sciences, China and The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3370-2263","authenticated-orcid":false,"given":"Jun","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, University of Chinese Academy of Sciences, China and The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6083-8433","authenticated-orcid":false,"given":"Kang","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, University of Chinese Academy of Sciences, China and The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2024,2,8]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6229"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.263"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-56979-0_3"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1284"},{"key":"e_1_3_2_6_2","volume-title":"TAC","author":"Bentivogli Luisa","year":"2009","unstructured":"Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC."},{"key":"e_1_3_2_7_2","article-title":"On the opportunities and risks of foundation models","author":"Bommasani Rishi","year":"2021","unstructured":"Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et\u00a0al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021). https:\/\/arxiv.org\/abs\/2108.07258","journal-title":"arXiv preprint arXiv:2108.07258"},{"key":"e_1_3_2_8_2","article-title":"Adversarial training for improving model robustness? Look at both prediction and interpretation","author":"Chen Hanjie","year":"2022","unstructured":"Hanjie Chen and Yangfeng Ji. 2022. Adversarial training for improving model robustness? Look at both prediction and interpretation. arXiv preprint arXiv:2203.12709 (2022). https:\/\/arxiv.org\/abs\/2203.12709","journal-title":"arXiv preprint arXiv:2203.12709"},{"key":"e_1_3_2_9_2","unstructured":"Zihan Chen Hongbo Zhang Xiaoji Zhang and Leqi Zhao. 2018. Quora question pairs. URL https:\/\/www.kaggle.com\/c\/quora-question-pairs (2018)."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_11_2","first-page":"2024","volume-title":"Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19\u201324, 2016 (JMLR Workshop and Conference Proceedings)","volume":"48","author":"Diamos Greg","year":"2016","unstructured":"Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse H. Engel, Awni Y. Hannun, and Sanjeev Satheesh. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19\u201324, 2016 (JMLR Workshop and Conference Proceedings), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. JMLR.org, 2024\u20132033. http:\/\/proceedings.mlr.press\/v48\/diamos16.html"},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the Third International Workshop on Paraphrasing (IWP\u201905)","author":"Dolan William B.","year":"2005","unstructured":"William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP\u201905). https:\/\/aclanthology.org\/I05-5002"},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","first-page":"968","DOI":"10.18653\/v1\/2021.findings-acl.84","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Feng Steven Y.","year":"2021","unstructured":"Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 968\u2013988."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3487045"},{"key":"e_1_3_2_15_2","article-title":"Distilling the knowledge in a neural network","author":"Hinton Geoffrey","year":"2015","unstructured":"Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). https:\/\/arxiv.org\/abs\/1503.02531","journal-title":"arXiv preprint arXiv:1503.02531"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.417"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1011"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.242"},{"key":"e_1_3_2_20_2","article-title":"Understanding neural networks through representation erasure","author":"Li Jiwei","year":"2016","unstructured":"Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016). https:\/\/arxiv.org\/abs\/1612.08220","journal-title":"arXiv preprint arXiv:1612.08220"},{"key":"e_1_3_2_21_2","first-page":"4765","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\u20139, 2017, Long Beach, CA, USA","author":"Lundberg Scott M.","year":"2017","unstructured":"Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4\u20139, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4765\u20134774. https:\/\/proceedings.neurips.cc\/paper\/2017\/hash\/8a20a8621978632d76c43dfd28b67767-Abstract.html"},{"key":"e_1_3_2_22_2","first-page":"14014","volume-title":"Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada","author":"Michel Paul","year":"2019","unstructured":"Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one?. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\u201314, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d\u2019Alch\u00e9-Buc, Emily B. Fox, and Roman Garnett (Eds.). 14014\u201314024. https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/2c601ad9d2ff9bc8b282670cdd54f69f-Abstract.html"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_24_2","article-title":"Evaluating explanations: How much do explanations from the teacher aid students?","author":"Pruthi Danish","year":"2020","unstructured":"Danish Pruthi, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C. Lipton, Graham Neubig, and William W. Cohen. 2020. Evaluating explanations: How much do explanations from the teacher aid students? arXiv preprint arXiv:2012.00893 (2020). https:\/\/arxiv.org\/abs\/2012.00893","journal-title":"arXiv preprint arXiv:2012.00893"},{"issue":"8","key":"e_1_3_2_25_2","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_3_2_28_2","article-title":"DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019). https:\/\/arxiv.org\/abs\/1910.01108","journal-title":"arXiv preprint arXiv:1910.01108"},{"key":"e_1_3_2_29_2","volume-title":"Proc. of AAAI","author":"Shen Sheng","year":"2020","unstructured":"Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. of AAAI."},{"key":"e_1_3_2_30_2","article-title":"Deep inside convolutional networks: Visualising image classification models and saliency maps","author":"Simonyan Karen","year":"2013","unstructured":"Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013). https:\/\/arxiv.org\/abs\/1312.6034","journal-title":"arXiv preprint arXiv:1312.6034"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.415"},{"key":"e_1_3_2_32_2","article-title":"SmoothGrad: Removing noise by adding noise","author":"Smilkov Daniel","year":"2017","unstructured":"Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi\u00e9gas, and Martin Wattenberg. 2017. SmoothGrad: Removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017). https:\/\/arxiv.org\/abs\/1706.03825","journal-title":"arXiv preprint arXiv:1706.03825"},{"key":"e_1_3_2_33_2","first-page":"1631","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Socher Richard","year":"2013","unstructured":"Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631\u20131642. https:\/\/aclanthology.org\/D13-1170"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1441"},{"key":"e_1_3_2_35_2","first-page":"3319","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\u201311 August 2017 (Proceedings of Machine Learning Research)","volume":"70","author":"Sundararajan Mukund","year":"2017","unstructured":"Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\u201311 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 3319\u20133328. http:\/\/proceedings.mlr.press\/v70\/sundararajan17a.html"},{"key":"e_1_3_2_36_2","article-title":"Distilling task-specific knowledge from BERT into simple neural networks","author":"Tang Raphael","year":"2019","unstructured":"Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling task-specific knowledge from BERT into simple neural networks. arXiv preprint arXiv:1903.12136 (2019). https:\/\/arxiv.org\/abs\/1903.12136","journal-title":"arXiv preprint arXiv:1903.12136"},{"key":"e_1_3_2_37_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_38_2","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=rJ4km2R5t7"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.496"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1670"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.5951\/TCM.8.9.0524"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1101"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00046"},{"key":"e_1_3_2_44_2","article-title":"Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression","author":"Xu Canwen","year":"2021","unstructured":"Canwen Xu, Wangchunshu Zhou, Tao Ge, Ke Xu, Julian McAuley, and Furu Wei. 2021. Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression. arXiv preprint arXiv:2109.03228 (2021). https:\/\/arxiv.org\/abs\/2109.03228","journal-title":"arXiv preprint arXiv:2109.03228"},{"key":"e_1_3_2_45_2","article-title":"Can explanations be useful for calibrating black box models?","author":"Ye Xi","year":"2021","unstructured":"Xi Ye and Greg Durrett. 2021. Can explanations be useful for calibrating black box models? arXiv preprint arXiv:2110.07586 (2021). https:\/\/arxiv.org\/abs\/2110.07586","journal-title":"arXiv preprint arXiv:2110.07586"},{"key":"e_1_3_2_46_2","article-title":"Q8BERT: Quantized 8Bit BERT","author":"Zafrir Ofir","year":"2019","unstructured":"Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. arXiv preprint arXiv:1910.06188 (2019). https:\/\/arxiv.org\/abs\/1910.06188","journal-title":"arXiv preprint arXiv:1910.06188"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639364","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3639364","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:54:10Z","timestamp":1750287250000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3639364"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,8]]},"references-count":45,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2,29]]}},"alternative-id":["10.1145\/3639364"],"URL":"https:\/\/doi.org\/10.1145\/3639364","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,8]]},"assertion":[{"value":"2023-04-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-14","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}