{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T10:25:21Z","timestamp":1771928721537,"version":"3.50.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,21]],"date-time":"2024-03-21T00:00:00Z","timestamp":1710979200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Digital Threats"],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Real-world malware analysis consists of a complex pipeline of classifiers and data analysis\u2014from detection to classification of capabilities to retrieval of unique training samples from user systems. In this article, we aim to reduce the complexity of these pipelines through the use of low-dimensional metric embeddings of Windows PE files, which can be used in a variety of downstream applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling of malicious and benign PE files with computationally-expensive, disassembly-based malicious capabilities information. Using this enhanced labeling, we derive several different types of efficient metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation, and combinations thereof. Our evaluation examines performance on a variety of transfer tasks performed on the EMBER and SOREL datasets, demonstrating that low-dimensional, computationally-efficient metric embeddings maintain performance with little decay. This offers the potential to quickly retrain for a variety of transfer tasks at significantly reduced overhead and complexity. We conclude with an examination of practical considerations for the use of our proposed embedding approach, such as robustness to adversarial evasion and introduction of task-specific auxiliary objectives to improve performance on mission critical tasks.<\/jats:p>","DOI":"10.1145\/3615669","type":"journal-article","created":{"date-parts":[[2023,8,16]],"date-time":"2023-08-16T12:12:52Z","timestamp":1692187972000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Efficient Malware Analysis Using Metric Embeddings"],"prefix":"10.1145","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8945-9533","authenticated-orcid":false,"given":"Ethan M.","family":"Rudd","sequence":"first","affiliation":[{"name":"Mandiant Inc., Reston, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5132-5377","authenticated-orcid":false,"given":"David","family":"Krisiloff","sequence":"additional","affiliation":[{"name":"Mandiant Inc., Reston, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6921-1842","authenticated-orcid":false,"given":"Scott","family":"Coull","sequence":"additional","affiliation":[{"name":"Mandiant Inc., Reston, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7807-7941","authenticated-orcid":false,"given":"Daniel","family":"Olszewski","sequence":"additional","affiliation":[{"name":"University of Florida, Gainesville, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9900-1972","authenticated-orcid":false,"given":"Edward","family":"Raff","sequence":"additional","affiliation":[{"name":"Booz Allen Hamilton, Columbia, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6368-8696","authenticated-orcid":false,"given":"James","family":"Holt","sequence":"additional","affiliation":[{"name":"Laboratory for Physical Sciences, University of Maryland, College Park, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,3,21]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Hyrum S. Anderson Anant Kharkar Bobby Filar David Evans and Phil Roth. 2018. Learning to evade static PE machine learning malware models via reinforcement learning. arXiv preprint arXiv:1801.08917 (2018)."},{"key":"e_1_3_2_3_2","unstructured":"Hyrum S. Anderson and Phil Roth. 2018. EMBER: An open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)."},{"key":"e_1_3_2_4_2","unstructured":"W. Ballenthin and M. Raabe. 2020. capa: Automatically identify malware capabilities. (2020). Retrieved from https:\/\/www.mandiant.com\/resources\/capa-automatically-identify-malware-capabilities. Accessed: 2022-08-05."},{"key":"e_1_3_2_5_2","first-page":"950","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Blondel Mathieu","year":"2020","unstructured":"Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. 2020. Fast differentiable sorting and ranking. In Proceedings of the International Conference on Machine Learning. PMLR, 950\u2013959."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/IMF.2013.18"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-39891-9_11"},{"key":"e_1_3_2_8_2","article-title":"On the database lookup problem of approximate matching","volume":"11","author":"Breitinger Frank","year":"2014","unstructured":"Frank Breitinger, Harald Baier, and Douglas White. 2014. On the database lookup problem of approximate matching. Digital Investigation 11, S1 (May2014), S1\u2013S9. https:\/\/www.sciencedirect.com\/science\/article\/pii\/S1742287614000061","journal-title":"Digital Investigation"},{"issue":"2","key":"e_1_3_2_9_2","first-page":"155","article-title":"An efficient similarity digests database lookup\u2014A logarithmic divide & conquer approach","volume":"9","author":"Breitinger Frank","year":"2014","unstructured":"Frank Breitinger, Christian Rathgeb, and Harald Baier. 2014. An efficient similarity digests database lookup\u2014A logarithmic divide & conquer approach. The Journal of Digital Forensics, Security and Law (JDFSL) 9, 2 (2014), 155\u2013166. DOI:http:\/\/ojs.jdfsl.org\/index.php\/jdfsl\/article\/view\/276","journal-title":"The Journal of Digital Forensics, Security and Law (JDFSL)"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/2950290.2950350"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","first-page":"302","DOI":"10.1007\/978-3-031-17551-0_20","volume-title":"Proceedings of the Science of Cyber Security: 4th International Conference, SciSec 2022, Matsue, Japan, August 10\u201312, 2022, Revised Selected Papers","author":"Chen Xiao","year":"2022","unstructured":"Xiao Chen, Zhengwei Jiang, Shuwei Wang, Rongqi Jing, Chen Ling, and Qiuyun Wang. 2022. Malware detected and tell me why: An verifiable malware detection model with graph metric learning. In Proceedings of the Science of Cyber Security: 4th International Conference, SciSec 2022, Matsue, Japan, August 10\u201312, 2022, Revised Selected Papers. Springer, 302\u2013314."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2021.3082330"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3473039"},{"key":"e_1_3_2_14_2","first-page":"452","volume-title":"Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security","author":"Dib Mirabelle","year":"2022","unstructured":"Mirabelle Dib, Sadegh Torabi, Elias Bou-Harb, Nizar Bouguila, and Chadi Assi. 2022. EVOLIoT: A self-supervised contrastive learning framework for detecting and characterizing evolving IoT malware variants. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security. 452\u2013466."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939719"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2019.00003"},{"key":"e_1_3_2_17_2","first-page":"647","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Donahue Jeff","year":"2014","unstructured":"Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning. PMLR, 647\u2013655."},{"key":"e_1_3_2_18_2","first-page":"249","volume-title":"Proceedings of the 13th International Conference on Artificial Intelligence and Statistics","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 249\u2013256."},{"key":"e_1_3_2_19_2","unstructured":"Richard Harang and Ethan M. Rudd. 2020. SOREL-20M: A large scale benchmark dataset for malicious PE detection. arXiv preprint arXiv:2012.07634 (2020)."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24261-3_7"},{"key":"e_1_3_2_21_2","first-page":"643","volume-title":"Proceedings of the ICISSP","author":"Jurecek Martin","year":"2021","unstructured":"Martin Jurecek, Olha Jureckov\u00e1, and R\u00f3bert L\u00f3rencz. 2021. Improving classification of malware families using learning a distance metric. In Proceedings of the ICISSP. 643\u2013652."},{"key":"e_1_3_2_22_2","first-page":"725","volume-title":"Proceedings of the ICISSP","author":"Jurecek Martin","year":"2020","unstructured":"Martin Jurecek and R\u00f3bert L\u00f3rencz. 2020. Distance metric learning using particle swarm optimization to improve static malware detection. In Proceedings of the ICISSP. 725\u2013732."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3094064"},{"issue":"9","key":"e_1_3_2_24_2","article-title":"Deep metric learning: A survey","volume":"11","author":"Kaya Mahmut","year":"2019","unstructured":"Mahmut Kaya and Hasan \u015eakir Bilge. 2019. Deep metric learning: A survey. Symmetry 11, 9 (2019), 1066.","journal-title":"Symmetry"},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the ICML Deep Learning Workshop","volume":"2","author":"Koch Gregory","year":"2015","unstructured":"Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et\u00a0al. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Vol. 2. Lille, 0."},{"key":"e_1_3_2_26_2","first-page":"533","volume-title":"Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO)","author":"Kolosnjaji Bojan","year":"2018","unstructured":"Bojan Kolosnjaji, Ambra Demontis, Battista Biggio, Davide Maiorca, Giorgio Giacinto, Claudia Eckert, and Fabio Roli. 2018. Adversarial malware binaries: Evading deep learning for malware detection in executables. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 533\u2013537."},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the CCS","author":"Li Xuezixiang","year":"2021","unstructured":"Xuezixiang Li, Yu Qu, and Heng Yin. 2021. PalmTree: Learning an assembly language model for instruction embedding. In Proceedings of the CCS."},{"key":"e_1_3_2_28_2","volume-title":"Proceedings of the 9th EAI International Conference on Digital Forensics and Cyber Crime (ICDF2C 2017)","author":"Lillis David","year":"2017","unstructured":"David Lillis, Frank Breitinger, and Mark Scanlon. 2017. Expediting MRSH-v2 approximate matching with hierarchical bloom filter trees. In Proceedings of the 9th EAI International Conference on Digital Forensics and Cyber Crime (ICDF2C 2017). Springer."},{"issue":"1","key":"e_1_3_2_29_2","first-page":"1","article-title":"FewM-HGCL: Few-shot malware variants detection via heterogeneous graph contrastive learning","author":"Liu Chen","year":"2022","unstructured":"Chen Liu, Bo Li, Jun Zhao, Ziyang Zhen, Xudong Liu, and Qunshi Zhang. 2022. FewM-HGCL: Few-shot malware variants detection via heterogeneous graph contrastive learning. IEEE Transactions on Dependable and Secure Computing1 (2022), 1\u201318. https:\/\/ieeexplore.ieee.org\/document\/9928211\/citations#citations","journal-title":"IEEE Transactions on Dependable and Secure Computing"},{"key":"e_1_3_2_30_2","first-page":"309","volume-title":"Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment","author":"Massarelli Luca","year":"2019","unstructured":"Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. 2019. SAFE: Self-attentive function embeddings for binary similarity. In Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment. 309\u2013329."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CTC.2013.9"},{"key":"e_1_3_2_32_2","doi-asserted-by":"crossref","first-page":"1332","DOI":"10.1109\/SP40000.2020.00073","volume-title":"Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP)","author":"Pierazzi Fabio","year":"2020","unstructured":"Fabio Pierazzi, Feargus Pendlebury, Jacopo Cortellazzi, and Lorenzo Cavallaro. 2020. Intriguing properties of adversarial ml attacks in the problem space. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1332\u20131349."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3128572.3140446"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.diin.2017.12.004"},{"key":"e_1_3_2_35_2","first-page":"303","volume-title":"Proceedings of the 28th USENIX Security Symposium (USENIX Security 19)","author":"Rudd Ethan M.","year":"2019","unstructured":"Ethan M. Rudd, Felipe N. Ducau, Cody Wild, Konstantin Berlin, and Richard Harang. 2019. \\(\\lbrace\\) ALOHA \\(\\rbrace\\) : Auxiliary loss optimization for hypothesis augmentation. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19). 303\u2013320."},{"key":"e_1_3_2_36_2","first-page":"19","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Rudd Ethan M.","year":"2016","unstructured":"Ethan M. Rudd, Manuel G\u00fcnther, and Terrance E. Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In Proceedings of the European Conference on Computer Vision. Springer, 19\u201335."},{"key":"e_1_3_2_37_2","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1145\/3494110.3528242","volume-title":"Proceedings of the 1st Workshop on Robust Malware Analysis","author":"Rudd Ethan M.","year":"2022","unstructured":"Ethan M. Rudd, Mohammad Saidur Rahman, and Philip Tully. 2022. Transformers for end-to-end InfoSec tasks: A feasibility study. In Proceedings of the 1st Workshop on Robust Malware Analysis. 21\u201331."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_2_39_2","unstructured":"David Sculley Gary Holt Daniel Golovin Eugene Davydov Todd Phillips Dietmar Ebner Vinay Chaudhary and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014)."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-45719-2_11"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.5555\/2831143.2831182"},{"key":"e_1_3_2_42_2","doi-asserted-by":"crossref","first-page":"990","DOI":"10.1145\/3488932.3497768","volume-title":"Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security","author":"Song Wei","year":"2022","unstructured":"Wei Song, Xuezixiang Li, Sadia Afroz, Deepali Garg, Dmitry Kuznetsov, and Heng Yin. 2022. MAB-Malware: A reinforcement learning framework for blackbox generation of adversarial malware. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security (Nagasaki, Japan). Association for Computing Machinery, 990\u20131003."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1109\/SPW.2019.00015","volume-title":"Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW)","author":"Suciu Octavian","year":"2019","unstructured":"Octavian Suciu, Scott E. Coull, and Jeffrey Johns. 2019. Exploring adversarial examples in malware detection. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW). IEEE, 8\u201314."},{"issue":"11","key":"e_1_3_2_44_2","article-title":"Visualizing data using t-SNE.","volume":"9","author":"Maaten Laurens Van der","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579\u20132605.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_45_2","unstructured":"VirusTotal 2022. VirusTotal\u2014Stats. Retrieved from https:\/\/www.virustotal.com\/gui\/stats. Accessed: 2022-08-04."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.diin.2013.08.003"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134018"},{"key":"e_1_3_2_48_2","volume-title":"Proceedings of the 4th Deep Learning and Security Workshop","author":"Yang Limin","year":"2021","unstructured":"Limin Yang, Arridhana Ciptadi, Ihar Laziuk, Ali Ahmadzadeh, and Gang Wang. 2021. BODMAS: An open dataset for learning based temporal analysis of PE malware. In Proceedings of the 4th Deep Learning and Security Workshop."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSN48987.2021.00036"},{"key":"e_1_3_2_50_2","article-title":"An android malware detection and classification approach based on contrastive lerning","volume":"123","author":"Yang Shaojie","year":"2022","unstructured":"Shaojie Yang, Yongjun Wang, Haoran Xu, Fangliang Xu, and Mantun Chen. 2022. An android malware detection and classification approach based on contrastive lerning. Computers & Security 123, 1 (2022), 102915. https:\/\/www.sciencedirect.com\/science\/article\/pii\/S016740482200308X","journal-title":"Computers & Security"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.5555\/3489212.3489345"}],"container-title":["Digital Threats: Research and Practice"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3615669","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3615669","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:17Z","timestamp":1750295417000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3615669"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,21]]},"references-count":50,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3615669"],"URL":"https:\/\/doi.org\/10.1145\/3615669","relation":{},"ISSN":["2692-1626","2576-5337"],"issn-type":[{"value":"2692-1626","type":"print"},{"value":"2576-5337","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,21]]},"assertion":[{"value":"2022-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}