{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T07:31:51Z","timestamp":1767339111967,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,21]],"date-time":"2024-03-21T00:00:00Z","timestamp":1710979200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Digital Threats"],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>\n            Once novel malware is detected, threat reports are written by security companies that discover it. The reports often vary in the terminology describing the behavior of the malware making comparisons of reports of the same malware from different companies difficult. To aid in the automated discovery of novel malware, it was recently proposed that novel malware could be detected by identifying behaviors. This assumes that a core set of behaviors are present in most, if not all, malware variants. However, there is a lack of malware datasets that are labeled with behaviors. Motivated by a need to label malware with a common set of behaviors, this work examines automating the process of labeling malware with behaviors identified in malware threat reports despite the variability of terminology. To do so, we examine several techniques from the natural language processing (NLP) domain. We find that most state-of-the-art word embedding NLP methods require large amounts of data and are trained on generic corpora of text data\u2014missing the nuances related to information security. To address this, we use simple feature selection techniques. We find that simple feature selection techniques generally outperform word embedding methods and achieve an increase of 6% in the\n            <jats:italic>F<\/jats:italic>\n            <jats:sub>.5<\/jats:sub>\n            -score over prior work when used to predict MITRE ATT&amp;CK tactics in threat reports. Our work indicates that feature selection, which has commonly been overlooked by sophisticated methods in NLP tasks, is beneficial for information security related tasks, where more sophisticated NLP methodologies are not able to pick out relevant information security terms.\n          <\/jats:p>","DOI":"10.1145\/3594553","type":"journal-article","created":{"date-parts":[[2023,5,17]],"date-time":"2023-05-17T12:00:43Z","timestamp":1684324843000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Improving Automated Labeling for ATT&amp;CK Tactics in Malware Threat Reports"],"prefix":"10.1145","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9425-3484","authenticated-orcid":false,"given":"Eva","family":"Domschot","sequence":"first","affiliation":[{"name":"New Mexico Institute of Mining and Technology, Socorro, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4343-535X","authenticated-orcid":false,"given":"Ramyaa","family":"Ramyaa","sequence":"additional","affiliation":[{"name":"New Mexico Institute of Mining and Technology, Socorro, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9167-9423","authenticated-orcid":false,"given":"Michael R.","family":"Smith","sequence":"additional","affiliation":[{"name":"Sandia National Laboratories, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,3,21]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2019. Credential Access. Retrieved from https:\/\/attack.mitre.org\/tactics\/TA0006\/."},{"volume-title":"Webroot Threat Report","year":"2020","key":"e_1_3_2_3_2","unstructured":"Webroot. 2020. Webroot Threat Report. Technical Report."},{"key":"e_1_3_2_4_2","unstructured":"2021. Preventing WannaCry (WCRY) Ransomware Attacks Using Trend Micro Products. Retrieved from https:\/\/success.trendmicro.com\/dcx\/s\/solution\/1117391-preventing-wannacry-wcry-ransomware-attacks-using-trend-micro-products?language=en_US&sfdcIFrameOrigin=null."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jnca.2011.01.002"},{"key":"e_1_3_2_6_2","unstructured":"Benjamin Ampel Sagar Samtani Steven Ullman and Benjamin Ampel. 2021. Linking Common vulnerabilities and exposures to the MITRE ATT&CK framework: A self-distillation approach. CoRR abs\/2108.01696 (2021). https:\/\/arxiv.org\/abs\/2108.01696."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICELTICs56128.2022.9932097"},{"key":"e_1_3_2_8_2","series-title":"Proceedings of the 25th Irish Conference on Artificial Intelligence and Cognitive Science","first-page":"272","volume":"2086","author":"Barry James","year":"2017","unstructured":"James Barry. 2017. Sentiment analysis of online reviews using bag-of-words and LSTM approaches. In Proceedings of the 25th Irish Conference on Artificial Intelligence and Cognitive Science(CEUR Workshop Proceedings, Vol. 2086), John McAuley and Susan McKeever (Eds.). CEUR-WS.org, 272\u2013274."},{"key":"e_1_3_2_9_2","unstructured":"Kiran Blanda. 2016. Aptnotes. Retrieved from https:\/\/github.com\/aptnotes."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.953"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.5555\/1892099"},{"key":"e_1_3_2_12_2","unstructured":"Exploit Database. 2003. Offensive security\u2019s Exploit Database Archive. Retrieved from https:\/\/www\/exploit-db.com\/."},{"key":"e_1_3_2_13_2","unstructured":"Luca Demetrio Battista Biggio Giovanni Lagorio Fabio Roli and Alessandro Armando. 2019. Explaining vulnerabilities of deep learning to adversarial malware binaries. arXiv:1901.03583. Retrieved from http:\/\/arxiv.org\/abs\/1901.03583."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1423"},{"key":"e_1_3_2_15_2","unstructured":"2021. Micro and Macro Averages for Imbalance Multiclass Classification. Retrieved from https:\/\/androidkt.com\/micro-macro-averages-for-imbalance-multiclass-classification\/."},{"key":"e_1_3_2_16_2","unstructured":"2022. allitems.txt.Z. Retrieved from https:\/\/cve.mitre.org\/."},{"key":"e_1_3_2_17_2","unstructured":"Noora Hyvarinen. 2015. The Dukes: 7 Years of Russian Cyber-Espionage. Retrieved from https:\/\/blog.f-secure.com\/the-dukes-7-years-of-russian-cyber-espionage\/."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1108\/eb026526"},{"key":"e_1_3_2_19_2","unstructured":"Faith Karabiber. 2021. Learn Data Science\u2014Tutorials Books Courses and More. Retrieved from https:\/\/www.learndatasci.com\/glossary\/tf-idf-term-frequency-inverse-document-frequency\/."},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","unstructured":"Alexander Katrompas and Vangelis Metsis. 2022. Enhancing LSTM models with self-attention and stateful training. In Intelligent Systems and Applications Kohei Arai (Ed.). Springer International Publishing Cham 217\u2013235.","DOI":"10.1007\/978-3-030-82193-7_14"},{"key":"e_1_3_2_21_2","unstructured":"Brian Krebs. 2009. Krebs on Security. Retrieved from https:\/\/krebsonsecurity.com\/."},{"issue":"5","key":"e_1_3_2_22_2","first-page":"9","article-title":"A survey on feature selection techniques and classification algorithms for efficient text classification","volume":"5","author":"Kumbhar Pradnya","year":"2016","unstructured":"Pradnya Kumbhar and Manisha Mali. 2016. A survey on feature selection techniques and classification algorithms for efficient text classification. Int. J. Sci. Res. 5, 5 (2016), 9.","journal-title":"Int. J. Sci. Res."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3465481.3465758"},{"key":"e_1_3_2_24_2","article-title":"Automated retrieval of att&ck tactics and techniques for cyber threat reports","author":"Legoy Valentine","year":"2020","unstructured":"Valentine Legoy, Marco Caselli, Christin Seifert, and Andreas Peter. 2020. Automated retrieval of att&ck tactics and techniques for cyber threat reports. arXiv:2004.14322. Retrieved from https:\/\/arxiv.org\/abs\/2004.14322.","journal-title":"arXiv:2004.14322"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1143"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.3115\/1118108.1118117"},{"key":"e_1_3_2_27_2","unstructured":"Infosecurity Magazine. 2019. Cybercrime Costs Global Economy $2.9M per Minute. Retrieved from https:\/\/www.infosecurity-magazine.com\/news\/cybercrime-costs-global-economy\/."},{"key":"e_1_3_2_28_2","first-page":"162","volume-title":"Proceedings of the 6th International Conference on Information and Communication Technology (ICoICT\u201918)","author":"Nurfikri Fahmi Salman","year":"2018","unstructured":"Fahmi Salman Nurfikri, Mohamad Syahrul Mubarok, et\u00a0al. 2018. News topic classification using mutual information and bayesian network. In Proceedings of the 6th International Conference on Information and Communication Technology (ICoICT\u201918). IEEE, 162\u2013166."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3056442"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/584792.584911"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData47090.2019.9005997"},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1145\/3411508.3421373","volume-title":"Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security","author":"Smith Michael R.","year":"2020","unstructured":"Michael R. Smith, Nicholas T. Johnson, Joe B. Ingram, Armida J. Carbajal, Bridget I. Haus, Eva Domschot, Ramyaa Ramyaa, Christopher C. Lamb, Stephen J. Verzi, and W. Philip Kegelmeyer. 2020. Mind the gap: On bridging the semantic gap between machine learning and malware analysis. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security. 49\u201360."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.187"},{"key":"e_1_3_2_35_2","first-page":"6","article-title":"A study on mutual information-based feature selection for text categorization","volume":"3","author":"Xu Yan","year":"2007","unstructured":"Yan Xu, Gareth Jones, Jintao Li, Bin Wang, and Chunming Sun. 2007. A study on mutual information-based feature selection for text categorization. J. Comput. Inf. Syst. 3 (032007), 6.","journal-title":"J. Comput. Inf. Syst."},{"key":"e_1_3_2_36_2","first-page":"412","volume-title":"Proceedings of the 14th International Conference on Machine Learning (ICML\u201997)","author":"Yang Yiming","year":"1997","unstructured":"Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML\u201997). Morgan Kaufmann Publishers Inc., San Francisco, CA, 412\u2013420."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3073559"},{"key":"e_1_3_2_38_2","unstructured":"Kim Zetter. 2014. DarkHotel: A Sophisticated New Hacking Attack Targets High-Profile Hotel Guests. Retrieved from https:\/\/www.wired.com\/2014\/11\/darkhotel-malware\/."},{"key":"e_1_3_2_39_2","series-title":"Proceedings of the 37th International Conference on Machine Learning","first-page":"11365","volume":"119","author":"Zhao Jingyu","year":"2020","unstructured":"Jingyu Zhao, Feiqing Huang, Jia Lv, Yanjie Duan, Zhen Qin, Guodong Li, and Guangjian Tian. 2020. Do RNN and LSTM have long memory? In Proceedings of the 37th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119), Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 11365\u201311375."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482399"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/1007730.1007741"}],"container-title":["Digital Threats: Research and Practice"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3594553","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3594553","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:07Z","timestamp":1750183747000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3594553"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,21]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3594553"],"URL":"https:\/\/doi.org\/10.1145\/3594553","relation":{},"ISSN":["2692-1626","2576-5337"],"issn-type":[{"type":"print","value":"2692-1626"},{"type":"electronic","value":"2576-5337"}],"subject":[],"published":{"date-parts":[[2024,3,21]]},"assertion":[{"value":"2022-05-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-23","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}