{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T06:56:15Z","timestamp":1769928975978,"version":"3.49.0"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"8","license":[{"start":{"date-parts":[[2019,11,17]],"date-time":"2019-11-17T00:00:00Z","timestamp":1573948800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,8,20]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Malware detection based on static features and without code disassembling is a challenging path of research. Obfuscation makes the static analysis of malware even more challenging. This paper extends static malware detection beyond byte level $n$-grams and detecting important strings. We propose a model (Byte2vec) with the capabilities of both binary file feature representation and feature selection for malware detection. Byte2vec embeds the semantic similarity of byte level codes into a feature vector (byte vector) and also into a context vector. The learned feature vectors of Byte2vec, using skip-gram with negative-sampling topology, are combined with byte-level term-frequency (tf) for malware detection. We also show that the distance between a feature vector and its corresponding context vector provides a useful measure to rank features. The top ranked features are successfully used for malware detection. We show that this feature selection algorithm is an unsupervised version of mutual information (MI). We test the proposed scheme on four freely available Android malware datasets including one obfuscated malware dataset. The model is trained only on clean APKs. The results show that the model outperforms MI in a low-dimensional feature space and is competitive with MI and other state-of-the-art models in higher dimensions. In particular, our tests show very promising results on a wide range of obfuscated malware with a false negative rate of only 0.3% and a false positive rate of 2.0%. The detection results on obfuscated malware show the advantage of the unsupervised feature selection algorithm compared with the MI-based method.<\/jats:p>","DOI":"10.1093\/comjnl\/bxz121","type":"journal-article","created":{"date-parts":[[2019,9,5]],"date-time":"2019-09-05T11:41:34Z","timestamp":1567683694000},"page":"1125-1138","source":"Crossref","is-referenced-by-count":8,"title":["Byte2vec: Malware Representation and Feature Selection for Android"],"prefix":"10.1093","volume":"63","author":[{"given":"Mahmood","family":"Yousefi-Azar","sequence":"first","affiliation":[{"name":"Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Len","family":"Hamey","sequence":"additional","affiliation":[{"name":"Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vijay","family":"Varadharajan","sequence":"additional","affiliation":[{"name":"Faculty of Engineering and Built Environment, University of Newcastle"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shiping","family":"Chen","sequence":"additional","affiliation":[{"name":"Commonwealth Scientific and Industrial Research Organisation, CSIRO, Data61"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2019,11,17]]},"reference":[{"key":"2020081712164345000_ref1","doi-asserted-by":"crossref","first-page":"2742","DOI":"10.1109\/TMC.2017.2687918","article-title":"Cloud-based malware detection game for mobile devices with offloading","volume":"16","author":"Xiao","year":"2017","journal-title":"IEEE Trans. Mobile Comput."},{"key":"2020081712164345000_ref2","doi-asserted-by":"crossref","DOI":"10.14722\/ndss.2017.23353","article-title":"Mamadroid: Detecting android malware by building markov chains of behavioral models","volume-title":"24th Annual Network and Distributed System Security Symposium, NDSS 2017","author":"Mariconti","year":"2017"},{"key":"2020081712164345000_ref3","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1007\/978-3-540-70542-0_6","article-title":"Learning and classification of malware behavior","volume-title":"Int. Conf. on Detection of Intrusions and Malware, and Vulnerability Assessment","author":"Rieck","year":"2008"},{"key":"2020081712164345000_ref4","first-page":"2721","article-title":"Learning to detect and classify malicious executables in the wild","volume":"7","author":"Kolter","year":"2006","journal-title":"J. Mach. Learn. Res."},{"key":"2020081712164345000_ref5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11416-016-0283-1","article-title":"An investigation of byte n-gram features for malware classification","volume":"14","author":"Raff","year":"2018","journal-title":"J. Computer Virol. Hacking Tech."},{"key":"2020081712164345000_ref6","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","volume-title":"Advances in Neural Information Processing Systems","author":"Mikolov","year":"2013"},{"key":"2020081712164345000_ref7","first-page":"2177","article-title":"Neural word embedding as implicit matrix factorization","volume-title":"Advances in Neural Information Processing Systems","author":"Levy","year":"2014"},{"key":"2020081712164345000_ref8","volume-title":"First Place Team: Say No to Overfitting","author":"Wang","year":"2015"},{"key":"2020081712164345000_ref9","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"TACL"},{"key":"2020081712164345000_ref10","volume-title":"Deep contextualized word representations. Proc. of NAACL","author":"Peters","year":"2018"},{"key":"2020081712164345000_ref11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/SSDSE.2017.8071952","article-title":"Malware detection using machine learning based on word2vec embeddings of machine code instructions","volume-title":"2017 Siberian Symposium on Data Science and Engineering (SSDSE)","author":"Popov","year":"2017"},{"key":"2020081712164345000_ref12","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1145\/3128572.3140442","article-title":"Learning the pe header, malware detection with minimal domain knowledge","volume-title":"Proc. of the 10th ACM Workshop on Artificial Intelligence and Security","author":"Raff","year":"2017"},{"key":"2020081712164345000_ref13","doi-asserted-by":"crossref","first-page":"S48","DOI":"10.1016\/j.diin.2018.01.007","article-title":"Maldozer: Automatic framework for android malware detection using deep learning","volume":"24","author":"Karbab","year":"2018","journal-title":"Digit. Invest."},{"key":"2020081712164345000_ref14","first-page":"533","article-title":"Adversarial malware binaries: Evading deep learning for malware detection in executables","volume-title":"26th European Signal Processing Conf., EUSIPCO 2018, Roma, Italy, September 3\u20137, 2018","author":"Kolosnjaji","year":"2018"},{"key":"2020081712164345000_ref15","article-title":"Semantic embeddings for program behavior patterns","volume-title":"CoRR","author":"Chistyakov","year":"2018"},{"key":"2020081712164345000_ref16","volume-title":"The elements of statistical learning: data mining, inference, and prediction","author":"Trevor","year":"2009"},{"key":"2020081712164345000_ref17","doi-asserted-by":"crossref","DOI":"10.1145\/3073559","article-title":"A survey on malware detection using data mining techniques","volume":"50","author":"Ye","year":"2017","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"2020081712164345000_ref18","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1145\/1835804.1835848","article-title":"Unsupervised feature selection for multi-cluster data","volume-title":"Proc. of the 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining","author":"Cai","year":"2010"},{"key":"2020081712164345000_ref19","first-page":"568","article-title":"Learning latent byte-level feature representation for malware detection","volume-title":"Neural Information Processing\u201425th Int. Conf., ICONIP 2018, Siem Reap, Cambodia, December 13\u201316, 2018, Proceedings, Part IV, Lecture Notes in Computer Science","author":"Yousefi-Azar","year":"2018"},{"key":"2020081712164345000_ref20","doi-asserted-by":"crossref","first-page":"95","DOI":"10.3115\/v1\/W15-1513","article-title":"Combining distributed vector representations for words","volume-title":"Proc. of the 1st Workshop on Vector Space Modeling for Natural Language Processing","author":"Garten","year":"2015"},{"key":"2020081712164345000_ref21","first-page":"2873","article-title":"The strange geometry of skip-gram with negative sampling","volume-title":"Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing","author":"Mimno","year":"2017"},{"key":"2020081712164345000_ref22","volume-title":"Drebin: Effective and explainable detection of android malware in your pocket","author":"Arp","year":"2014"},{"key":"2020081712164345000_ref23","doi-asserted-by":"crossref","first-page":"49418","DOI":"10.1109\/ACCESS.2018.2864871","article-title":"Malytics: A malware detection scheme","volume":"6","author":"Yousefi-Azar","year":"2018","journal-title":"IEEE Access"},{"key":"2020081712164345000_ref24","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1007\/978-3-319-60876-1_12","article-title":"Deep ground truth analysis of current android malware","volume-title":"Int. Conf. on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA\u201917)","author":"Wei","year":"2017"},{"key":"2020081712164345000_ref25","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.cose.2015.02.007","article-title":"Stealth attacks: An extended insight into the obfuscation effects on android malware","volume":"51","author":"Maiorca","year":"2015","journal-title":"Comput. Secur."},{"key":"2020081712164345000_ref26","first-page":"468","article-title":"Androzoo: Collecting millions of android apps for the research community","volume-title":"2016 IEEE\/ACM 13th Working Conference on Mining Software Repositories (MSR)","author":"Allix","year":"2016"},{"key":"2020081712164345000_ref27","first-page":"abs\/1808.03698","article-title":"Boost: Boosting smooth trees for partial effect estimation in nonlinear regressions","volume-title":"CoRR","author":"Fonseca","year":"2018"},{"key":"2020081712164345000_ref28","first-page":"2387","article-title":"Evasion and hardening of tree ensemble classifiers","volume-title":"Int. Conf. on Machine Learning","author":"Kantchelian","year":"2016"},{"key":"2020081712164345000_ref29","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1016\/j.patcog.2018.07.023","article-title":"Wild patterns: Ten years after the rise of adversarial machine learning","volume":"84","author":"Biggio","year":"2018","journal-title":"Pattern Recognit."},{"key":"2020081712164345000_ref30","first-page":"625","article-title":"Transcend: Detecting concept drift in malware classification models","volume-title":"Proc. of the 26th Usenix Security Symposium (Usenix Security\u201917)","author":"Jordaney","year":"2017"},{"key":"2020081712164345000_ref31","first-page":"106","volume-title":"The problem of concept drift: definitions and related work","author":"Tsymbal","year":"2004"},{"key":"2020081712164345000_ref32","article-title":"Adversarial perturbations against deep neural networks for malware classification","volume-title":"CoRR","author":"Grosse","year":"2016"},{"key":"2020081712164345000_ref33","article-title":"Android malware detection based on factorization machine","volume-title":"CoRR","author":"Li","year":"2018"},{"key":"2020081712164345000_ref34","doi-asserted-by":"crossref","first-page":"2563","DOI":"10.1109\/TIFS.2018.2824250","article-title":"Coevolution of mobile malware and anti-malware","volume":"13","author":"Sen","year":"2018","journal-title":"IEEE Trans. Inf. Foren. Sec."},{"key":"2020081712164345000_ref35","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1016\/j.future.2018.07.066","article-title":"Androdet: An adaptive android obfuscation detector","volume":"90","author":"Mirzaei","year":"2019","journal-title":"Future Gener. Comp. Syst."},{"key":"2020081712164345000_ref36","first-page":"1","article-title":"Android malware detection via graphlet sampling","volume":"12","author":"Gao","year":"2018","journal-title":"IEEE Trans. Mobile Comput.,"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/comjnl\/article-pdf\/63\/8\/1125\/33657142\/bxz121.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/comjnl\/article-pdf\/63\/8\/1125\/33657142\/bxz121.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,1,18]],"date-time":"2021-01-18T09:27:56Z","timestamp":1610962076000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/63\/8\/1125\/5618685"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,17]]},"references-count":36,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2019,11,17]]},"published-print":{"date-parts":[[2020,8,20]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxz121","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"value":"0010-4620","type":"print"},{"value":"1460-2067","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,8]]},"published":{"date-parts":[[2019,11,17]]}}}