{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T15:25:13Z","timestamp":1773156313307,"version":"3.50.1"},"reference-count":112,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U20B2047, 62072421, 62002334, 62102386, and 62121002"],"award-info":[{"award-number":["U20B2047, 62072421, 62002334, 62102386, and 62121002"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Open Fund of Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation","award":["CSSAE-2021-007"],"award-info":[{"award-number":["CSSAE-2021-007"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>\n                    Analyzing the behavior of cryptographic functions in stripped binaries is a challenging but essential task, which is crucial in software security fields such as malware analysis and legacy code inspection. However, the inherent high logical complexity of cryptographic algorithms makes their analysis more difficult than that of ordinary code, and the general absence of symbolic information in binaries exacerbates this challenge. Existing methods for cryptographic algorithm identification frequently rely on data or structural pattern matching, which limits their generality and effectiveness while requiring substantial manual effort. In response to these challenges, we present\n                    <jats:italic toggle=\"yes\">F<\/jats:italic>\n                    igure\n                    <jats:italic toggle=\"yes\">o<\/jats:italic>\n                    ut the\n                    <jats:italic toggle=\"yes\">C<\/jats:italic>\n                    ryptographic functions (FoC), a novel framework that leverages Large Language Models (LLMs) to identify and analyze cryptographic functions in stripped binaries.\n                  <\/jats:p>\n                  <jats:p>\n                    In FoC, we first build an LLM-based generative model (\n                    <jats:italic toggle=\"yes\">FoC-BinLLM<\/jats:italic>\n                    ) to summarize the semantics of cryptographic functions in natural language form, which is intuitively readable to analysts. Subsequently, based on the semantic insights provided by FoC-BinLLM, we further develop a binary code similarity detection model (\n                    <jats:italic toggle=\"yes\">FoC-Sim<\/jats:italic>\n                    ), which allows analysts to effectively retrieve similar implementations of unknown cryptographic functions from a library of known cryptographic functions. The predictions of generative model like FoC-BinLLM are inherently difficult to reflect minor alterations in binary code, such as those introduced by vulnerability patches. In contrast, the change-sensitive representations generated by FoC-Sim compensate for the shortcomings to some extent. To support the development and evaluation of these models, and to facilitate further research in this domain, we also construct a comprehensive cryptographic binary dataset and introduce an automatic method to create semantic labels for extensive binary functions. Our evaluation results are promising. FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score, demonstrating superior capability in summarizing the semantics of cryptographic functions. FoC-Sim also surpasses previous best methods with a 52% higher Recall@1 in retrieving similar cryptographic functions. Beyond these metrics, our method has proven its practical utility in real-world scenarios, including cryptographic-related virus analysis and 1-day vulnerability detection.\n                  <\/jats:p>","DOI":"10.1145\/3731449","type":"journal-article","created":{"date-parts":[[2025,4,22]],"date-time":"2025-04-22T11:53:12Z","timestamp":1745322792000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["FoC: Figure Out the Cryptographic Functions in Stripped Binaries with LLMs"],"prefix":"10.1145","volume":"35","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-6660-9947","authenticated-orcid":false,"given":"Xiuwei","family":"Shang","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0651-6617","authenticated-orcid":false,"given":"Guoqiang","family":"Chen","sequence":"additional","affiliation":[{"name":"QI\u2010ANXIN Technology Research Institute, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3992-9509","authenticated-orcid":false,"given":"Shaoyin","family":"Cheng","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China and Anhui Province Key Laboratory of Digital Security, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8554-6365","authenticated-orcid":false,"given":"Shikai","family":"Guo","sequence":"additional","affiliation":[{"name":"Dalian Maritime University, The Dalian Key Laboratory of Artificial Intelligence, Dalian, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1554-2556","authenticated-orcid":false,"given":"Yanming","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5576-6108","authenticated-orcid":false,"given":"Weiming","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China and Anhui Province Key Laboratory of Digital Security, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4417-9316","authenticated-orcid":false,"given":"Nenghai","family":"Yu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China and Anhui Province Key Laboratory of Digital Security, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,12,11]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Hex-Rays SA. 2023. IDA Pro. Retrieved from https:\/\/www.hex-rays.com\/products\/ida"},{"key":"e_1_3_2_3_2","unstructured":"NationalSecurityAgency. 2023. Ghidra. Retrieved from https:\/\/github.com\/NationalSecurityAgency\/ghidra"},{"key":"e_1_3_2_4_2","unstructured":"polymorf. 2022. findcrypt-yara. Retrieved from https:\/\/github.com\/polymorf\/findcrypt-yara"},{"key":"e_1_3_2_5_2","unstructured":"Sirmabus. 2015. Ida_signsrch. Retrieved from https:\/\/github.com\/nihilus\/IDA_Signsrch"},{"key":"e_1_3_2_6_2","first-page":"200","volume-title":"Proceedings of the 14th European Symposium on Research in Computer Security (ESORICS \u201909)","author":"Wang Zhi","year":"2009","unstructured":"Zhi Wang, Xuxian Jiang, Weidong Cui, Xinyuan Wang, and Mike Grace. 2009. ReFormat: Automatic reverse engineering of encrypted messages. In Proceedings of the 14th European Symposium on Research in Computer Security (ESORICS \u201909). Springer, 200\u2013215."},{"key":"e_1_3_2_7_2","first-page":"41","volume-title":"Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID\u00a0\u201911)","author":"Gr\u00f6bert Felix","year":"2011","unstructured":"Felix Gr\u00f6bert, Carsten Willems, and Thorsten Holz. 2011. Automated identification of cryptographic primitives in binary programs. In Proceedings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID\u00a0\u201911). Springer, 41\u201360."},{"key":"e_1_3_2_8_2","article-title":"Detection of cryptographic algorithms with grap","author":"Benedetti L\u00e9onard","year":"2017","unstructured":"L\u00e9onard Benedetti, Aur\u00e9lien Thierry, and Julien Francq. 2017. Detection of cryptographic algorithms with grap. Cryptology ePrint Archive.","journal-title":"Cryptology ePrint Archive"},{"key":"e_1_3_2_9_2","first-page":"412","volume-title":"Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security","author":"Li Juanru","year":"2018","unstructured":"Juanru Li, Zhiqiang Lin, Juan Caballero, Yuanyuan Zhang, and Dawu Gu. 2018. K-Hunt: Pinpointing insecure cryptographic keys from execution traces. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 412\u2013425."},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1109\/ICSSA45270.2018.00015","volume-title":"Proceedings of the 2018 International Conference on Software Security and Assurance (ICSSA)","author":"Kochberger Patrick","year":"2018","unstructured":"Patrick Kochberger and Florian Seitl. 2018. Detecting cryptography through IR visualization. In Proceedings of the 2018 International Conference on Software Security and Assurance (ICSSA). IEEE, 25\u201329."},{"key":"e_1_3_2_11_2","first-page":"101","volume-title":"Proceedings of the International Conference on Information Security and Cryptology","author":"Zhao Ruoxu","year":"2013","unstructured":"Ruoxu Zhao, Dawu Gu, Juanru Li, and Yuanyuan Zhang. 2013. Automatic detection and analysis of encrypted messages in malware. In Proceedings of the International Conference on Information Security and Cryptology. Springer, 101\u2013117."},{"issue":"8","key":"e_1_3_2_12_2","first-page":"2628","article-title":"Binary code level cyclic feature recognition of cryptographic algorithm","volume":"35","author":"Li Jizhong","year":"2014","unstructured":"Jizhong Li, Liehui Jiang, and Hu Shu. 2014. Binary code level cyclic feature recognition of cryptographic algorithm. Computer Engineering and Design 35, 8 (2014), 2628\u20132632.","journal-title":"Computer Engineering and Design"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2714576.2714639"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1145\/2382196.2382217","volume-title":"Proceedings of the 2012 ACM Conference on Computer and Communications Security","author":"Calvet Joan","year":"2012","unstructured":"Joan Calvet, Jos\u00e9 M. Fernandez, and Jean-Yves Marion. 2012. Aligot: Cryptographic function identification in obfuscated binary programs. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, 169\u2013182."},{"key":"e_1_3_2_15_2","first-page":"921","volume-title":"Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP)","author":"Xu Dongpeng","year":"2017","unstructured":"Dongpeng Xu, Jiang Ming, and Dinghao Wu. 2017. Cryptographic function detection in obfuscated binaries via bit-precise symbolic loop mapping. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 921\u2013937."},{"key":"e_1_3_2_16_2","first-page":"555","volume-title":"Proceedings of the 30th USENIX Security Symposium (USENIX Security \u201921)","author":"Meijer Carlo","year":"2021","unstructured":"Carlo Meijer, Veelasha Moonsamy, and Jos Wetzels. 2021. Where\u2019s crypto? Automated identification and classification of proprietary cryptographic primitives in binary code. In Proceedings of the 30th USENIX Security Symposium (USENIX Security \u201921), 555\u2013572."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3632742"},{"key":"e_1_3_2_18_2","first-page":"260","volume-title":"Proceedings of the 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","author":"Al-Kaswan Ali","year":"2023","unstructured":"Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant, Premkumar Devanbu, and Arie van Deursen. 2023. Extending source code pre-trained language models to summarise decompiled binaries. In Proceedings of the 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 260\u2013271."},{"key":"e_1_3_2_19_2","first-page":"774","volume-title":"Proceedings of the 2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE)","author":"Xiong Jiaqi","year":"2023","unstructured":"Jiaqi Xiong, Guoqiang Chen, Kejiang Chen, Han Gao, Shaoyin Cheng, and Weiming Zhang. 2023. HexT5: Unified pre-training for stripped binary code information inference. In Proceedings of the 2023 38th IEEE\/ACM International Conference on Automated Software Engineering (ASE). IEEE, 774\u2013786"},{"key":"e_1_3_2_20_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde De Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_2_21_2","unstructured":"Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model. Retrieved from https:\/\/github.com\/kingoflolz\/mesh-transformer-jax"},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Sid Black Stella Biderman Eric Hallahan Quentin Anthony Leo Gao Laurence Golding Horace He Connor Leahy Kyle McDonell Jason Phang et al. 2021. GPT-NeoX-20B: An open-source autoregressive language model. arXiv:2204.06745. Retrieved from https:\/\/arxiv.org\/abs\/2204.06745","DOI":"10.18653\/v1\/2022.bigscience-1.9"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3520312.3534862"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2017.2655046"},{"key":"e_1_3_2_25_2","first-page":"309","volume-title":"Proceedings of the 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA \u201919)","author":"Massarelli Luca","year":"2019","unstructured":"Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. Safe: Self-attentive function embeddings for binary similarity. In Proceedings of the 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA \u201919). Springer, 309\u2013329."},{"key":"e_1_3_2_26_2","volume-title":"Proceedings of the 2020 Network and Distributed Systems Security Symposium (NDSS)","author":"Duan Yue","year":"2020","unstructured":"Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DeepBinDiff: Learning program-wide code representations for binary diffing. In Proceedings of the 2020 Network and Distributed Systems Security Symposium (NDSS)."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460120.3484587"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3231621"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3533767.3534367"},{"issue":"1","key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3604611","article-title":"Asteria-Pro: Enhancing deep learning-based binary code similarity detection by incorporating domain knowledge","volume":"33","author":"Yang Shouguo","year":"2023","unstructured":"Shouguo Yang, Chaopeng Dong, Yang Xiao, Yiran Cheng, Zhiqiang Shi, Zhi Li, and Limin Sun. 2023. Asteria-Pro: Enhancing deep learning-based binary code similarity detection by incorporating domain knowledge. ACM Transactions on Software Engineering and Methodology 33, 1 (2023), 1\u201340.","journal-title":"ACM Transactions on Software Engineering and Methodology"},{"key":"e_1_3_2_31_2","first-page":"2014","article-title":"Efficient and secure elliptic curve cryptography implementation of curve p-256","volume":"66","author":"Adalier Mehmet","year":"2015","unstructured":"Mehmet Adalier and Antara Teknik. 2015. Efficient and secure elliptic curve cryptography implementation of curve p-256. In Proceedings of the Workshop on Elliptic Curve Cryptography Standards, Vol. 66, 2014\u20132017.","journal-title":"Proceedings of the Workshop on Elliptic Curve Cryptography Standards"},{"key":"e_1_3_2_32_2","article-title":"Advanced encryption standard (AES)","author":"NIST FIPS Pub","year":"2001","unstructured":"NIST FIPS Pub. 2001. Advanced encryption standard (AES). Federal Information Processing Standards Publication 197(441):0311.","journal-title":"Federal Information Processing Standards Publication"},{"key":"e_1_3_2_33_2","unstructured":"CVE-2014-0160. 2013. Available from MITRE CVE-ID CVE-2014-0160. Retrieved December 3 2013 from https:\/\/nvd.nist.gov\/vuln\/detail\/CVE-2014-0160"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.3390\/info9090231"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jisa.2021.103088"},{"key":"e_1_3_2_36_2","unstructured":"Ilfak Guilfanov. 2006. FindCrypt2. Retrieved from https:\/\/hex-rays.com\/blog\/findcrypt2\/"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/1856\/1\/012015"},{"key":"e_1_3_2_38_2","unstructured":"OpenSSL. 2023. OpenSSL. Retrieved from https:\/\/github.com\/openssl\/openssl"},{"key":"e_1_3_2_39_2","doi-asserted-by":"crossref","unstructured":"Yue Wang Weishi Wang Shafiq Joty Steven and C. H. Hoi. 2021. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv:2109.00859. Retrieved from https:\/\/arxiv.org\/abs\/2109.00859","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_3_2_40_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et al. 2023. A survey of large language models. arXiv:2303.18223. Retrieved from https:\/\/arxiv.org\/abs\/2303.18223"},{"key":"e_1_3_2_41_2","unstructured":"Ben Mann N. Ryder M. Subbiah J. Kaplan P. Dhariwal A. Neelakantan P. Shyam G. Sastry A. Askell S. Agarwal et al. 2020. Language models are few-shot learners. arXiv:2005.14165. Retrieved from https:\/\/arxiv.org\/abs\/2005.14165"},{"issue":"240","key":"e_1_3_2_42_2","first-page":"1","article-title":"PaLM: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1\u2013113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_43_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et al. 2023. LLaMA: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Yue Wang Hung Le Akhilesh Deepak Gotmare Nghi D. Q. Bui Junnan Li and Steven C. H. Hoi. 2023. CodeT5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 1069\u20131088.","DOI":"10.18653\/v1\/2023.emnlp-main.68"},{"key":"e_1_3_2_45_2","unstructured":"Ziyang Luo Can Xu Pu Zhao Qingfeng Sun Xiubo Geng Wenxiang Hu Chongyang Tao Jing Ma Qingwei Lin and Daxin Jiang. 2023. WizardCoder: Empowering code large language models with Evol-Instruct. arXiv:2306.08568. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2306.08568"},{"key":"e_1_3_2_46_2","unstructured":"Baptiste Roziere Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin et al. 2023. Code Llama: Open foundation models for code. arXiv:2308.12950. Retrieved from https:\/\/arxiv.org\/abs\/2308.12950"},{"key":"e_1_3_2_47_2","unstructured":"Jie Qin Jie Wu Weifeng Chen Yuxi Ren Huixia Li Hefeng Wu Xuefeng Xiao Rui Wang and Shilei Wen. 2024. DiffusionGPT: LLM-driven text-to-image generation system. arXiv:2401.10061. Retrieved from https:\/\/arxiv.org\/abs\/2401.10061"},{"key":"e_1_3_2_48_2","article-title":"Free-Bloom: Zero-shot text-to-video generator with LLM director and LDM animator","volume":"36","author":"Huang Hanzhuo","year":"2024","unstructured":"Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, and Sibei Yang. 2024. Free-Bloom: Zero-shot text-to-video generator with LLM director and LDM animator. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_49_2","unstructured":"Xin Jin Jonathan Larson Weiwei Yang and Zhiqiang Lin. 2023. Binary code summarization: Benchmarking ChatGPT\/GPT-4 and other large language models. arXiv:2312.09601. Retrieved from https:\/\/arxiv.org\/abs\/2312.09601"},{"key":"e_1_3_2_50_2","unstructured":"I. UNIX International. 2010. Dwarf Debugging Information Format Version 4. Retrieved from https:\/\/dwarfstd.org\/doc\/DWARF4.pdf"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Ruoyu Zhang Yanzeng Li Yongliang Ma Ming Zhou and Lei Zou. 2023. LLMaAA: Making large language models as active annotators. arXiv:2310.19596. Retrieved from https:\/\/arxiv.org\/abs\/2310.19596","DOI":"10.18653\/v1\/2023.findings-emnlp.872"},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","unstructured":"Ruixuan Xiao Yiwen Dong Junbo Zhao Runze Wu Minmin Lin Gang Chen and Haobo Wang. 2023. FreeAL: Towards human-free active learning in the era of large language models. arXiv:2311.15614. Retrieved from https:\/\/arxiv.org\/abs\/2311.15614","DOI":"10.18653\/v1\/2023.emnlp-main.896"},{"key":"e_1_3_2_53_2","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35, 27730\u201327744.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_54_2","article-title":"CodeGEN: An open large language model for code with multi-turn program synthesis","author":"Nijkamp Erik","year":"2023","unstructured":"Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. CodeGEN: An open large language model for code with multi-turn program synthesis. In Proceedings of the International Conference on Learning Representations (ICLR).","journal-title":"Proceedings of the International Conference on Learning Representations (ICLR)"},{"key":"e_1_3_2_55_2","unstructured":"Hamel Husain Ho-Hsiang Wu Tiferet Gazit Miltiadis Allamanis and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv:1909.09436. Retrieved from https:\/\/arxiv.org\/abs\/1909.09436"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.abq1158"},{"key":"e_1_3_2_57_2","first-page":"2099","volume-title":"Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922)","author":"Marcelli Andrea","year":"2022","unstructured":"Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How machine learning is solving the binary function similarity problem. In Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922), 2099\u20132116."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/1653662.1653737"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2019.2956932"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134018"},{"key":"e_1_3_2_61_2","unstructured":"Matthew Henderson Rami Al-Rfou Brian Strope Yun-Hsuan Sung L\u00e1szl\u00f3 Luk\u00e1cs Ruiqi Guo Sanjiv Kumar Balint Miklos and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv:1705.00652. Retrieved from https:\/\/arxiv.org\/abs\/1705.00652"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3548606.3560612"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3428293"},{"key":"e_1_3_2_64_2","unstructured":"Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE 21\u201329."},{"issue":"2","key":"e_1_3_2_65_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3434280","article-title":"Why my code summarization model does not work: Code comment improvement with category prediction","volume":"30","author":"Chen Qiuyuan","year":"2021","unstructured":"Qiuyuan Chen, Xin Xia, Han Hu, David Lo, and Shanping Li. 2021. Why my code summarization model does not work: Code comment improvement with category prediction. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1\u201329.","journal-title":"ACM Transactions on Software Engineering and Methodology"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3550150"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3631975"},{"key":"e_1_3_2_68_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311\u2013318."},{"key":"e_1_3_2_69_2","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, 65\u201372."},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377811.3380383"},{"key":"e_1_3_2_71_2","first-page":"74","volume-title":"Text Summarization Branches Out","author":"Lin (Ed.) Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin (Ed.). 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74\u201381."},{"key":"e_1_3_2_72_2","article-title":"VulHawk: Cross-architecture vulnerability detection with entropy-based binary code search","author":"Luo Zhenhao","year":"2023","unstructured":"Zhenhao Luo, Pengfei Wang, Baosheng Wang, Yong Tang, Wei Xie, Xu Zhou, Danjun Liu, and Kai Lu. 2023. VulHawk: Cross-architecture vulnerability detection with entropy-based binary code search. In Proceedings of the Network and Distributed System Security (NDSS) Symposium.","journal-title":"Proceedings of the Network and Distributed System Security (NDSS) Symposium"},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN48987.2021.00036"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3187689"},{"key":"e_1_3_2_75_2","first-page":"951","volume-title":"Proceedings of the 2015 IEEE\/ACM 37th IEEE International Conference on Software Engineering","volume":"2","author":"Maletic Jonathan I.","year":"2015","unstructured":"Jonathan I. Maletic and Michael L. Collard. 2015. Exploration, analysis, and manipulation of source code using SRCML. In Proceedings of the 2015 IEEE\/ACM 37th IEEE International Conference on Software Engineering, Vol. 2, 951\u2013952."},{"key":"e_1_3_2_76_2","unstructured":"Facebook Inc. 2023. PyTorch. Retrieved from https:\/\/pytorch.org"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_78_2","unstructured":"Microsoft. 2023. DeepSpeed. Retrieved from https:\/\/github.com\/microsoft\/DeepSpeed\/"},{"key":"e_1_3_2_79_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra Singh Chaplot Diego de las Casas Emma Bou Hanna Florian Bressand et al. 2024. Mixtral of experts. arXiv:2401.04088. Retrieved from https:\/\/arxiv.org\/abs\/2401.04088"},{"key":"e_1_3_2_80_2","doi-asserted-by":"crossref","first-page":"2375","DOI":"10.1109\/SP46215.2023.10179439","volume-title":"Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP)","author":"Patrick-Evans James","year":"2023","unstructured":"James Patrick-Evans, Moritz Dannehl, and Johannes Kinder. 2023. XFL: Naming functions in binaries with extreme multi-label learning. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2375\u20132390."},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939756"},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3264820.3264821"},{"key":"e_1_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00003"},{"key":"e_1_3_2_84_2","first-page":"3835","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Li Yujia","year":"2019","unstructured":"Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning. PMLR, 3835\u20133845."},{"key":"e_1_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3691620.3695070"},{"key":"e_1_3_2_86_2","first-page":"887","volume-title":"Proceedings of the 27th USENIX Security Symposium (USENIX Security \u201918)","author":"Zhang Hang","year":"2018","unstructured":"Hang Zhang and Zhiyun Qian. 2018. Precise and accurate patch presence test for binaries. In Proceedings of the 27th USENIX Security Symposium (USENIX Security \u201918), 887\u2013902."},{"key":"e_1_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397361"},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.1145\/3604608"},{"key":"e_1_3_2_89_2","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1109\/DSN48063.2020.00028","volume-title":"Proceedings of the 2020 50th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN)","author":"Chen Ligeng","year":"2020","unstructured":"Ligeng Chen, Zhongling He, and Bing Mao. 2020. CATI: Context-assisted type inference from stripped binaries. In Proceedings of the 2020 50th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 88\u201398."},{"key":"e_1_3_2_90_2","first-page":"413","volume-title":"Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922)","author":"Vadayath Jayakrishna","year":"2022","unstructured":"Jayakrishna Vadayath, Moritz Eckert, Kyle Zeng, Nicolaas Weideman, Gokulkrishna Praveen Menon, Yanick Fratantonio, Davide Balzarotti, Adam Doup\u00e9, Tiffany Bao, Ruoyu Wang, et al. 2022. Arbiter: Bridging the static and dynamic divide in vulnerability discovery on binary programs. In Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922), 413\u2013430."},{"key":"e_1_3_2_91_2","doi-asserted-by":"publisher","DOI":"10.1145\/3238147.3240480"},{"key":"e_1_3_2_92_2","first-page":"7339","volume-title":"Proceedings of the 32nd USENIX Security Symposium (USENIX Security \u201923)","author":"Wang Junzhe","year":"2023","unstructured":"Junzhe Wang, Matthew Sharp, Chuxiong Wu, Qiang Zeng, and Lannan Luo. 2023. Can a deep learning model for one architecture be used for others? Retargeted-architecture binary code analysis. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security \u201923), 7339\u20137356."},{"key":"e_1_3_2_93_2","unstructured":"Intel Corporation. 2023. Pin\u2014A Dynamic Binary Instrumentation Tool. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/articles\/tool\/pin-a-dynamic-binary-instrumentation-tool.html"},{"key":"e_1_3_2_94_2","unstructured":"Valgrind. 2024. Retrieved from https:\/\/valgrind.org\/"},{"key":"e_1_3_2_95_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460319.3464804"},{"key":"e_1_3_2_96_2","doi-asserted-by":"crossref","unstructured":"Vikram Nitin Anthony Saieva Baishakhi Ray and Gail Kaiser. 2021. Direct: A transformer-based model for decompiled variable name recovery. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog \u201921) 48.","DOI":"10.18653\/v1\/2021.nlp4prog-1.6"},{"key":"e_1_3_2_97_2","doi-asserted-by":"crossref","first-page":"813","DOI":"10.1109\/SP40001.2021.00051","volume-title":"Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP)","author":"Zhang Zhuo","year":"2021","unstructured":"Zhuo Zhang, Yapeng Ye, Wei You, Guanhong Tao, Wen-Chuan Lee, Yonghwi Kwon, Yousra Aafer, and Xiangyu Zhang. 2021. OSPREY: Recovery of variable and data structure via probabilistic analysis for stripped binary. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 813\u2013832."},{"key":"e_1_3_2_98_2","first-page":"4327","volume-title":"Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922)","author":"Chen Qibin","year":"2022","unstructured":"Qibin Chen, Jeremy Lacomis, Edward J. Schwartz, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2022. Augmenting decompiler output with learned variable names and types. In Proceedings of the 31st USENIX Security Symposium (USENIX Security \u201922), 4327\u20134343."},{"key":"e_1_3_2_99_2","doi-asserted-by":"publisher","DOI":"10.1145\/3243734.3243866"},{"key":"e_1_3_2_100_2","doi-asserted-by":"publisher","DOI":"10.14722\/ndss.2024.24401"},{"key":"e_1_3_2_101_2","unstructured":"Jiang Nan Wang Chengxiao Liu Kevin Xu Xiangzhe Tan Lin Zhang Xiangyu and Babkin Petr. 2024. Nova: Generative language models for assembly code with hierarchical attention and contrastive learning. arXiv:2311.13721. Retrieved from https:\/\/arxiv.org\/abs\/2311.13721"},{"key":"e_1_3_2_102_2","unstructured":"Daya Guo Qihao Zhu Dejian Yang Zhenda Xie Kai Dong Wentao Zhang Guanting Chen Xiao Bi Yu Wu Y. K. Li et al. 2024. DeepSeek-Coder: When the large language model meets programming\u2014The rise of code intelligence. arXiv:2401.14196. Retrieved from https:\/\/arxiv.org\/abs\/2401.14196"},{"key":"e_1_3_2_103_2","unstructured":"Hanzhuo Tan Qi Luo Jing Li and Yuqun Zhang. 2024. LLM4Decompile: Decompiling binary code with large language models. arXiv:2403.05286. Retrieved from https:\/\/arxiv.org\/abs\/2403.05286"},{"key":"e_1_3_2_104_2","unstructured":"Huaqing.AI. 2024. Machine language model MLM. Retrieved from https:\/\/mlm01.com\/"},{"key":"e_1_3_2_105_2","unstructured":"Xinyi Hou Yanjie Zhao Yue Liu Zhou Yang Kailong Wang Li Li Xiapu Luo David Lo John Grundy and Haoyu Wang. 2023. Large language models for software engineering: A systematic literature review. arXiv:2308.10620. Retrieved from https:\/\/arxiv.org\/abs\/2308.10620"},{"key":"e_1_3_2_106_2","unstructured":"Quanjun Zhang Chunrong Fang Yang Xie Yaxin Zhang Yun Yang Weisong Sun Shengcheng Yu and Zhenyu Chen. 2023. A survey on large language models for software engineering. arXiv:2312.15223. Retrieved from https:\/\/arxiv.org\/abs\/2312.15223"},{"key":"e_1_3_2_107_2","unstructured":"Giriprasad Sridhara and Sourav Mazumdar. 2023. ChatGPT: A study on its utility for ubiquitous software engineering tasks. arXiv:2305.16837. Retrieved from https:\/\/arxiv.org\/abs\/2305.16837"},{"key":"e_1_3_2_108_2","unstructured":"Shantanu Mandal Adhrik Chethan Vahid Janfaza S. M. Mahmud Todd A. Anderson Javier Turek Jesmin Jahan Tithi and Abdullah Muzahid. 2023. Large language models based automatic synthesis of software specifications. arXiv:2304.09181. Retrieved from https:\/\/arxiv.org\/abs\/2304.09181"},{"key":"e_1_3_2_109_2","unstructured":"Yihong Dong Xue Jiang Zhi Jin and Ge Li. 2023. Self-collaboration code generation via ChatGPT. arXiv:2304.07590. Retrieved from https:\/\/arxiv.org\/abs\/2304.07590"},{"key":"e_1_3_2_110_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2023.111623"},{"key":"e_1_3_2_111_2","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/ICSME55016.2022.00020","volume-title":"Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","author":"Alhamed Mohammed","year":"2022","unstructured":"Mohammed Alhamed and Tim Storer. 2022. Evaluation of context-aware language models and experts for effort estimation of software maintenance issues. In Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 129\u2013138."},{"issue":"4","key":"e_1_3_2_112_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3705302","article-title":"Context-based transfer learning for structuring fault localization and program repair automation","volume":"34","author":"Zhang Lehuan","year":"2024","unstructured":"Lehuan Zhang, Shikai Guo, Yi Guo, Hui Li, Yu Chai, Rong Chen, Xiaochen Li, and He Jiang. 2024. Context-based transfer learning for structuring fault localization and program repair automation. ACM Transactions on Software Engineering and Methodology 34, 4 (2024), 1\u201332.","journal-title":"ACM Transactions on Software Engineering and Methodology"},{"issue":"140","key":"e_1_3_2_113_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731449","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T15:57:09Z","timestamp":1765468629000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731449"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,11]]},"references-count":112,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3731449"],"URL":"https:\/\/doi.org\/10.1145\/3731449","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,11]]},"assertion":[{"value":"2024-09-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-12","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}