{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T14:33:42Z","timestamp":1754145222520,"version":"3.41.2"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"ISSTA","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2025,6,22]]},"abstract":"<jats:p>Code obfuscation is a technique used to protect software by making it difficult to understand and reverse engineer. However, it can also be exploited for malicious purposes such as code plagiarism or developing malicious programs. Learning-based techniques have achieved great success with the help of supervised learning and labeled training sets. However, when faced with real-life environments involving privately developed and undisclosed obfuscators, these supervised learning methods often raise concerns about generalizability and robustness when facing unseen and unknown classes of obfuscation techniques.<\/jats:p>\n          <jats:p>This paper presents ALMOND, a novel zero-shot approach for detecting code obfuscation in binary executables. Unlike previous supervised learning methods, ALMOND does not require labeled obfuscated samples for training. Instead, it leverages a language model pre-trained only on unobfuscated assembly code to identify the linguistic deviations introduced by obfuscation. The key innovation is the use of \u201derror-perplexity\u201d as a detection metric, which focuses on tokens the model fails to predict. Continuous Error Perplexity further enhances this to capture consecutive prediction errors characteristic of obfuscated sequences. Experiments show ALMOND achieves 96.3% accuracy on unseen obfuscation methods, outperforming supervised baselines. On real-world malware samples, it demonstrates an AUC of 0.869 and significantly outperforms the supervise-learning baseline. Our Dataset, pre-trained model, and code of evaluation will be available at https:\/\/github.com\/palmtreemodel\/ALMOND<\/jats:p>","DOI":"10.1145\/3728886","type":"journal-article","created":{"date-parts":[[2025,6,22]],"date-time":"2025-06-22T10:52:56Z","timestamp":1750589576000},"page":"366-387","source":"Crossref","is-referenced-by-count":0,"title":["ALMOND: Learning an Assembly Language Model for 0-Shot Code Obfuscation Detection"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-9713-3815","authenticated-orcid":false,"given":"Xuezixiang","family":"Li","sequence":"first","affiliation":[{"name":"University of California at Riverside, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5189-7140","authenticated-orcid":false,"given":"Sheng","family":"Yu","sequence":"additional","affiliation":[{"name":"Deepbits Technology, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8942-7742","authenticated-orcid":false,"given":"Heng","family":"Yin","sequence":"additional","affiliation":[{"name":"University of California at Riverside, Riverside, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,22]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3564625.3567975"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230833.3232823"},{"key":"e_1_2_1_3_1","volume-title":"Code obfuscation literature survey. CS701 Construction of compilers, 19","author":"Balakrishnan Arini","year":"2005","unstructured":"Arini Balakrishnan and Chloe Schulze. 2005. Code obfuscation literature survey. CS701 Construction of compilers, 19 (2005), 31."},{"key":"e_1_2_1_4_1","volume-title":"Fish Wang, and Chitta Baral.","author":"Banerjee Pratyay","year":"2021","unstructured":"Pratyay Banerjee, Kuntal Kumar Pal, Fish Wang, and Chitta Baral. 2021. Variable name recovery in decompiled binary code using constrained masked language modeling. arXiv preprint arXiv:2103.12801."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2019.05.007"},{"key":"e_1_2_1_6_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00003"},{"key":"e_1_2_1_8_1","unstructured":"Marius Dr\u0103goi Elena Burceanu Emanuela Haller Andrei Manolache and Florin Brad. 2022. AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection. Neural Information Processing Systems NeurIPS Datasets and Benchmarks Track."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134015"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/177910.177914"},{"key":"e_1_2_1_11_1","article-title":"Enabling Obfuscation Detection in Binary Software through eXplainable AI","author":"Greco Claudia","year":"2024","unstructured":"Claudia Greco, Michele Ianni, Antonella Guzzo, and Giancarlo Fortino. 2024. Enabling Obfuscation Detection in Binary Software through eXplainable AI. IEEE Transactions on Emerging Topics in Computing.","journal-title":"IEEE Transactions on Emerging Topics in Computing."},{"key":"e_1_2_1_12_1","volume-title":"DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In 28th USENIX Security Symposium (USENIX Security 19)","author":"Guo Wenbo","year":"2019","unstructured":"Wenbo Guo, Dongliang Mu, Xinyu Xing, Min Du, and Dawn Song. 2019. DEEPVSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA. 1787\u20131804. isbn:978-1-939133-06-9 https:\/\/www.usenix.org\/conference\/usenixsecurity19\/presentation\/guo"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2973023"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-014-5473-9"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","first-page":"102804","DOI":"10.1016\/j.cose.2022.102804","article-title":"IFAttn: Binary code similarity analysis based on interpretable features with attention","volume":"120","author":"Jiang Shuai","year":"2022","unstructured":"Shuai Jiang, Cai Fu, Yekui Qian, Shuai He, Jianqiang Lv, and Lansheng Han. 2022. IFAttn: Binary code similarity analysis based on interpretable features with attention. Computers & Security, 120 (2022), 102804.","journal-title":"Computers & Security"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3548606.3560612"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/SPRO.2015.10"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/SPRO.2015.14"},{"key":"e_1_2_1_19_1","volume-title":"AI 2004: Advances in Artificial Intelligence: 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004. Proceedings 17","author":"Kibriya Ashraf M","year":"2005","unstructured":"Ashraf M Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. 2005. Multinomial naive bayes for text categorization revisited. In AI 2004: Advances in Artificial Intelligence: 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004. Proceedings 17. 488\u2013499."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSSA.2017.29"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510155"},{"key":"e_1_2_1_22_1","volume-title":"Zero-shot anomaly detection via batch normalization. Neural Information Processing Systems NeurIPS, 36","author":"Li Aodong","year":"2024","unstructured":"Aodong Li, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja Rudolph, and Stephan Mandt. 2024. Zero-shot anomaly detection via batch normalization. Neural Information Processing Systems NeurIPS, 36 (2024)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3460120.3484587"},{"key":"e_1_2_1_24_1","doi-asserted-by":"crossref","first-page":"1722","DOI":"10.3390\/electronics12071722","article-title":"Codeformer","volume":"12","author":"Liu Guangming","year":"2023","unstructured":"Guangming Liu, Xin Zhou, Jianmin Pang, Feng Yue, Wenfu Liu, and Junchao Wang. 2023. Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection. Electronics, 12, 7 (2023), 1722.","journal-title":"Electronics"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 2024 IEEE\/ACM 46th International Conference on Software Engineering: Companion Proceedings. 364\u2013365","author":"Liu Yilun","year":"2024","unstructured":"Yilun Liu, Shimin Tao, Weibin Meng, Feiyu Yao, Xiaofeng Zhao, and Hao Yang. 2024. Logprompt: Prompt engineering towards zero-shot and interpretable log analysis. In Proceedings of the 2024 IEEE\/ACM 46th International Conference on Software Engineering: Companion Proceedings. 364\u2013365."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2007.48"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the International Conference on Software Engineering Research and Practice (SERP06)","author":"Madou Matias","year":"2006","unstructured":"Matias Madou, Bertrand Anckaert, Bruno De Bus, Koen De Bosschere, Jan Cappaert, and Bart Preneel. 2006. On the effectiveness of source code transformations for binary obfuscation. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP06). 527\u2013533."},{"volume-title":"Introduction to Information Retrieval","author":"Manning Christopher D.","key":"e_1_2_1_28_1","unstructured":"Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00fctze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. isbn:978-0-521-86571-5 http:\/\/nlp.stanford.edu\/IR-book\/information-retrieval-book.html"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-22038-9_15"},{"key":"e_1_2_1_30_1","unstructured":"Tomas Mikolov. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781."},{"key":"e_1_2_1_31_1","volume-title":"Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26 (2013)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.07.066"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10139"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/WFPST58552.2024.00034"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3468264.3468607"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2022.3231621"},{"key":"e_1_2_1_37_1","unstructured":"Raul Puri and Bryan Catanzaro. 2019. Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165."},{"key":"e_1_2_1_38_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training."},{"key":"e_1_2_1_39_1","volume-title":"Language models are unsupervised multitask learners. OpenAI blog, 1, 8","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3052449"},{"key":"e_1_2_1_41_1","volume-title":"Learning representations by back-propagating errors. nature, 323, 6088","author":"Rumelhart David E","year":"1986","unstructured":"David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature, 323, 6088 (1986), 533\u2013536."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3015135.3015136"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3015135.3015136"},{"key":"e_1_2_1_44_1","volume-title":"Generalized learning vector quantization. Advances in neural information processing systems, 8","author":"Sato Atsushi","year":"1995","unstructured":"Atsushi Sato and Keiji Yamada. 1995. Generalized learning vector quantization. Advances in neural information processing systems, 8 (1995)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2886012"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2024.103958"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3488932.3497768"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1108\/eb026526"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14081-5_23"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-06365-7_13"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3371307.3371313"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3533767.3534367"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.685"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3274694.3274726"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","unstructured":"Yonghui Wu Mike Schuster Zhifeng Chen Quoc V Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao and Klaus Macherey. 2016. Google\u2019s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144 https:\/\/doi.org\/10.48550\/arXiv.1609.08144 10.48550\/arXiv.1609.08144","DOI":"10.48550\/arXiv.1609.08144"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2917668"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/BWCCA.2010.85"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i01.5466"},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Jingqing Zhang Piyawat Lertvittayakumjorn and Yike Guo. 2019. Integrating semantic knowledge to tackle zero-shot text classification. arXiv preprint arXiv:1903.12626.","DOI":"10.18653\/v1\/N19-1108"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2020.102072"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3728886","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T16:52:11Z","timestamp":1752684731000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3728886"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,22]]},"references-count":60,"journal-issue":{"issue":"ISSTA","published-print":{"date-parts":[[2025,6,22]]}},"alternative-id":["10.1145\/3728886"],"URL":"https:\/\/doi.org\/10.1145\/3728886","relation":{},"ISSN":["2994-970X"],"issn-type":[{"type":"electronic","value":"2994-970X"}],"subject":[],"published":{"date-parts":[[2025,6,22]]}}}