{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T21:00:54Z","timestamp":1770238854447,"version":"3.49.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","license":[{"start":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T00:00:00Z","timestamp":1720742400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CNS-2120386"],"award-info":[{"award-number":["CNS-2120386"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100009226","name":"National Security Agency","doi-asserted-by":"publisher","award":["NCAE-C-002-2021"],"award-info":[{"award-number":["NCAE-C-002-2021"]}],"id":[{"id":"10.13039\/100009226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2024,7,12]]},"abstract":"<jats:p>\n                    Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is pre-trained on a different dataset. We propose S\n                    <jats:sc>lim<\/jats:sc>\n                    C\n                    <jats:sc>ode<\/jats:sc>\n                    , a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization, we reported that 1) the removal ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm\u2013prompt engineering and interactive in-context learning. The empirical results showed that S\n                    <jats:sc>lim<\/jats:sc>\n                    C\n                    <jats:sc>ode<\/jats:sc>\n                    can improve the state-of-the-art technique by\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>9.46<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    and\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>5.15<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    in terms of MRR and BLEU score on code search and summarization, respectively. More importantly, S\n                    <jats:sc>lim<\/jats:sc>\n                    C\n                    <jats:sc>ode<\/jats:sc>\n                    is 133 times faster than the state-of-the-art approach. Additionally, S\n                    <jats:sc>lim<\/jats:sc>\n                    C\n                    <jats:sc>ode<\/jats:sc>\n                    can reduce the cost of invoking GPT-4 by up to\n                    <jats:inline-formula>\n                      <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\" display=\"inline\">\n                        <mml:mrow>\n                          <mml:mn>24<\/mml:mn>\n                          <mml:mtext>%<\/mml:mtext>\n                        <\/mml:mrow>\n                      <\/mml:math>\n                    <\/jats:inline-formula>\n                    per API query, while still producing comparable results to those with the original code. With this result, we call for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.\n                  <\/jats:p>","DOI":"10.1145\/3643753","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T10:22:09Z","timestamp":1720779729000},"page":"586-608","source":"Crossref","is-referenced-by-count":7,"title":["Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9876-5823","authenticated-orcid":false,"given":"Yan","family":"Wang","sequence":"first","affiliation":[{"name":"Central University of Finance and Economics, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6845-2721","authenticated-orcid":false,"given":"Xiaoning","family":"Li","sequence":"additional","affiliation":[{"name":"Central University of Finance and Economics, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7962-6090","authenticated-orcid":false,"given":"Tien N.","family":"Nguyen","sequence":"additional","affiliation":[{"name":"University of Texas at Dallas, Dallas, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5777-7759","authenticated-orcid":false,"given":"Shaohua","family":"Wang","sequence":"additional","affiliation":[{"name":"Central University of Finance and Economics, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2906-0598","authenticated-orcid":false,"given":"Chao","family":"Ni","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8189-2040","authenticated-orcid":false,"given":"Ling","family":"Ding","sequence":"additional","affiliation":[{"name":"Central University of Finance and Economics, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,7,12]]},"reference":[{"key":"e_1_3_1_2_2","article-title":"Unified pre-training for program understanding and generation","author":"Ahmad Wasi Uddin","year":"2021","unstructured":"Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).","journal-title":"arXiv preprint arXiv:2103.06333"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510049"},{"key":"e_1_3_1_4_2","volume-title":"6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings","author":"Allamanis Miltiadis","year":"2018","unstructured":"Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=BJOFETxR-"},{"key":"e_1_3_1_5_2","unstructured":"ChatGPT [n. d.]. OpenAI. https:\/\/openai.com\/."},{"key":"e_1_3_1_6_2","article-title":"PyMT5: multi-mode translation of natural language and Python code with transformers","author":"Clement Colin B","year":"2020","unstructured":"Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020).","journal-title":"arXiv preprint arXiv:2010.03150"},{"key":"e_1_3_1_7_2","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).","journal-title":"arXiv preprint arXiv:1810.04805"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_3_1_9_2","article-title":"Unixcoder: Unified cross-modal pre-training for code representation","author":"Guo Daya","year":"2022","unstructured":"Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).","journal-title":"arXiv preprint arXiv:2203.03850"},{"key":"e_1_3_1_10_2","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021","author":"Guo Daya","year":"2021","unstructured":"Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https:\/\/openreview.net\/forum?id=jLoC4ez43PZ"},{"key":"e_1_3_1_11_2","article-title":"Codesearchnet challenge: Evaluating the state of semantic code search","author":"Husain Hamel","year":"2019","unstructured":"Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).","journal-title":"arXiv preprint arXiv:1909.09436"},{"key":"e_1_3_1_12_2","unstructured":"JAST2DyPDG [n. d.]. JAST2DyPDG. https:\/\/github.com\/hpnog\/javaDependenceGraph."},{"key":"e_1_3_1_13_2","first-page":"54","volume-title":"Uncertainty in Artificial Intelligence","author":"Jiang Xue","year":"2021","unstructured":"Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. Treebert: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54\u201363."},{"key":"e_1_3_1_14_2","article-title":"Pre-trained contextual embedding of source code","author":"Kanade Aditya","year":"2019","unstructured":"Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2019. Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059 (2019).","journal-title":"arXiv preprint arXiv:2001.00059"},{"key":"e_1_3_1_15_2","article-title":"Scelmo: Source code embeddings from language models","author":"Karampatsis Rafael-Michael","year":"2020","unstructured":"Rafael-Michael Karampatsis and Charles Sutton. 2020. Scelmo: Source code embeddings from language models. arXiv preprint arXiv:2004.13214 (2020).","journal-title":"arXiv preprint arXiv:2004.13214"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678927"},{"key":"e_1_3_1_17_2","unstructured":"Latitude [n. d.]. Latitude. https:\/\/www.cnbc.com\/2023\/03\/13\/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html."},{"key":"e_1_3_1_18_2","article-title":"Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","author":"Lewis Mike","year":"2019","unstructured":"Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).","journal-title":"arXiv preprint arXiv:1910.13461"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377811.3380345"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00060"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00067"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3468597"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510177"},{"key":"e_1_3_1_24_2","article-title":"CodeReviewer: Pre-training for automating code review activities","author":"Li Zhiyu","year":"2022","unstructured":"Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, et al. 2022. CodeReviewer: Pre-training for automating code review activities. arXiv e-prints (2022), arXiv\u20132203.","journal-title":"arXiv e-prints"},{"key":"e_1_3_1_25_2","article-title":"Roberta: A robustly optimized bert pretraining approach","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","unstructured":"Antonio Mastropaolo Simone Scalabrino Nathan Cooper David Nader Palacio Denys Poshyvanyk Rocco Oliveto and Gabriele Bavota. 2021. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. In 2021 IEEE\/ACM 43rd International Conference on Software Engineering (ICSE). 336\u2013347. https:\/\/doi.org\/10.1109\/ICSE43902.2021.00041 10.1109\/ICSE43902.2021.00041","DOI":"10.1109\/ICSE43902.2021.00041"},{"key":"e_1_3_1_27_2","unstructured":"Maven [n. d.]. AST Maven. https:\/\/mvnrepository.com\/artifact."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377811.3380926"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3549165"},{"key":"e_1_3_1_30_2","article-title":"Defect Identification, Categorization, and Repair: Better Together","author":"Ni Chao","year":"2022","unstructured":"Chao Ni, Kaiwen Yang, Xin Xia, David Lo, Xiang Chen, and Xiaohu Yang. 2022. Defect Identification, Categorization, and Repair: Better Together. arXiv preprint arXiv:2204.04856 (2022).","journal-title":"arXiv preprint arXiv:2204.04856"},{"key":"e_1_3_1_31_2","article-title":"CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis","author":"Nijkamp Erik","year":"2022","unstructured":"Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv preprint (2022).","journal-title":"arXiv preprint"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/775"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.08774"},{"key":"e_1_3_1_34_2","unstructured":"OPENAI-Pricing [n. d.]. OPENAI-Pricing. https:\/\/openai.com\/pricing."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","unstructured":"Matteo Paltenghi and Michael Pradel. 2021. Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code. In 2021 36th IEEE\/ACM International Conference on Automated Software Engineering (ASE). 867\u2013879. https:\/\/doi.org\/10.1109\/ASE51524.2021.9678712 10.1109\/ASE51524.2021.9678712","DOI":"10.1109\/ASE51524.2021.9678712"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3468539"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00349"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3468545"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510050"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2019.00058"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00189"},{"key":"e_1_3_1_42_2","article-title":"Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation","author":"Wang Xin","year":"2021","unstructured":"Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).","journal-title":"arXiv preprint arXiv:2108.04556"},{"key":"e_1_3_1_43_2","article-title":"Codet5+: Open code large language models for code understanding and generation","author":"Wang Yue","year":"2023","unstructured":"Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).","journal-title":"arXiv preprint arXiv:2305.07922"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","unstructured":"Yue Wang Weishi Wang Shafiq Joty and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics Online and Punta Cana Dominican Republic 8696\u20138708. https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.685 10.18653\/v1\/2021.emnlp-main.685","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_3_1_45_2","article-title":"Xlnet: Generalized autoregressive pretraining for language understanding","volume":"32","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).","journal-title":"Advances in neural information processing systems"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/32.988498"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3540250.3549094"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643753","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643753","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643753","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T08:07:28Z","timestamp":1770192448000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643753"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,12]]},"references-count":46,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2024,7,12]]}},"alternative-id":["10.1145\/3643753"],"URL":"https:\/\/doi.org\/10.1145\/3643753","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,12]]}}}