{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T17:54:58Z","timestamp":1773510898859,"version":"3.50.1"},"reference-count":77,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,5,8]],"date-time":"2021-05-08T00:00:00Z","timestamp":1620432000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Discovery Early Career Researcher Award","award":["DE200100016"],"award-info":[{"award-number":["DE200100016"]}]},{"name":"European Union's Horizon 2020 research and innovation program","award":["830892"],"award-info":[{"award-number":["830892"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61702045 and 62072046"],"award-info":[{"award-number":["61702045 and 62072046"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Discovery project","award":["DP200100020"],"award-info":[{"award-number":["DP200100020"]}]},{"name":"Fonds National de la Recherche (FNR), Luxembourg, under project CHARACTERIZE","award":["C17\/IS\/11693861"],"award-info":[{"award-number":["C17\/IS\/11693861"]}]},{"name":"SPARTA project"},{"name":"Australian Research Council (ARC) under a Laureate Fellowship","award":["FL190100035"],"award-info":[{"award-number":["FL190100035"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2021,7,31]]},"abstract":"<jats:p>Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded experimental results and insights. In this article, we perform extensive experiments to measure the performance gap that occurs when datasets are de-duplicated. Our experimental results reveal that duplication in published datasets has a limited impact on supervised malware classification models. This observation contrasts with the finding of Allamanis on the general case of machine learning bias for big code. Our experiments, however, show that sample duplication more substantially affects unsupervised learning models (e.g., malware family clustering). Nevertheless, we argue that our fellow researchers and practitioners should always take sample duplication into consideration when performing machine-learning-based (via either supervised or unsupervised learning) Android malware detections, no matter how significant the impact might be.<\/jats:p>","DOI":"10.1145\/3446905","type":"journal-article","created":{"date-parts":[[2021,5,8]],"date-time":"2021-05-08T11:40:33Z","timestamp":1620474033000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":54,"title":["On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection"],"prefix":"10.1145","volume":"30","author":[{"given":"Yanjie","family":"Zhao","sequence":"first","affiliation":[{"name":"Monash University, Australia, Clayton, VIC"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2990-1614","authenticated-orcid":false,"given":"Li","family":"Li","sequence":"additional","affiliation":[{"name":"Monash University, Australia, Clayton, VIC"}]},{"given":"Haoyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, PRC, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5224-9970","authenticated-orcid":false,"given":"Haipeng","family":"Cai","sequence":"additional","affiliation":[{"name":"Washington State University, United States, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7270-9869","authenticated-orcid":false,"given":"Tegawend\u00e9 F.","family":"Bissyand\u00e9","sequence":"additional","affiliation":[{"name":"University of Luxembourg, Luxembourg"}]},{"given":"Jacques","family":"Klein","sequence":"additional","affiliation":[{"name":"University of Luxembourg, Luxembourg"}]},{"given":"John","family":"Grundy","sequence":"additional","affiliation":[{"name":"Monash University, Australia, Clayton, VIC"}]}],"member":"320","published-online":{"date-parts":[[2021,5,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Wikipedia contributors. 2020. Sequential minimal optimization. https:\/\/en.wikipedia.org\/wiki\/Sequential_minimal_optimization"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-04283-1_6"},{"key":"e_1_2_1_3_1","volume-title":"2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing. IEEE, 663--669","author":"Mohammed","unstructured":"Mohammed S. Alam and Son T. Vuong. 2013. Random forest classification for detecting Android malware. In 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing. IEEE, 663--669."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359591.3359735"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-014-9352-6"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901739.2903508"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/MALWARE.2015.7413693"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14722\/ndss.2014.23247"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2666356.2594299"},{"key":"e_1_2_1_10_1","first-page":"228","article-title":"Permission-based android malware detection","volume":"2","author":"Aung Zarni","year":"2013","unstructured":"Zarni Aung and Win Zaw. 2013. Permission-based android malware detection. International Journal of Scientific & Technology Research 2, 3 (2013), 228--234.","journal-title":"International Journal of Scientific & Technology Research"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/2818754.2818808"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the Network and Distributed System Security Symposium (NDSS'09)","volume":"9","author":"Bayer Ulrich","year":"2009","unstructured":"Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In Proceedings of the Network and Distributed System Security Symposium (NDSS'09), Vol. 9. 8--11."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1504\/IJESDF.2007.016865"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDMW.2016.0046"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5220\/0005537800270038"},{"key":"e_1_2_1_16_1","volume-title":"Software engineering for fairness: A case study with hyperparameter optimization. arXiv preprint arXiv:1905.05786","author":"Chakraborty Joymallya","year":"2019","unstructured":"Joymallya Chakraborty, Tianpei Xia, Fahmid M. Fahid, and Tim Menzies. 2019. Software engineering for fairness: A case study with hyperparameter optimization. arXiv preprint arXiv:1905.05786 (2019)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2017.2739145"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2556464.2556467"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236024.3236045"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2018.2806891"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2019.00085"},{"key":"e_1_2_1_22_1","volume-title":"Can we trust your explanations? Sanity checks for interpreters in Android malware analysis. arXiv preprint arXiv:2008.05895","author":"Fan Ming","year":"2020","unstructured":"Ming Fan, Wenying Wei, Xiaofei Xie, Yang Liu, Xiaohong Guan, and Ting Liu. 2020. Can we trust your explanations? Sanity checks for interpreters in Android malware analysis. arXiv preprint arXiv:2008.05895 (2020)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACT.2010.33"},{"key":"e_1_2_1_24_1","volume-title":"Tuning for software analytics: Is it really necessary?Information and Software Technology 76","author":"Fu Wei","year":"2016","unstructured":"Wei Fu, Tim Menzies, and Xipeng Shen. 2016. Tuning for software analytics: Is it really necessary?Information and Software Technology 76 (2016), 135--146."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/SANER.2019.8668010"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.2019.2956690"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409745"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3162625"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/2819009.2819151"},{"key":"e_1_2_1_30_1","volume-title":"Mutantx-s: Scalable malware clustering based on static features. In Presented as Part of the 2013 {USENIX} Annual Technical Conference ({USENIX} {ATC}\u201913). 187--198.","author":"Hu Xin","year":"2013","unstructured":"Xin Hu, Kang G. Shin, Sandeep Bhatkar, and Kent Griffin. 2013. Mutantx-s: Scalable malware clustering based on static features. In Presented as Part of the 2013 {USENIX} Annual Technical Conference ({USENIX} {ATC}\u201913). 187--198."},{"key":"e_1_2_1_31_1","volume-title":"Dating with scambots: Understanding the ecosystem of fraudulent dating applications","author":"Hu Yangyu","year":"2019","unstructured":"Yangyu Hu, Haoyu Wang, Yajin Zhou, Yao Guo, Li Li, Bingxuan Luo, and Fangren Xu. 2019. Dating with scambots: Understanding the ecosystem of fraudulent dating applications. IEEE Transactions on Dependable and Secure Computing (TDSC) (2019)."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2017.57"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11416-018-0316-z"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICC.2014.6883436"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.diin.2018.01.007"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.2018.2865733"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2958927"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2017.2789219"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME.2017.49"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/QRS.2015.36"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2015.48"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE.2018.00031"},{"key":"e_1_2_1_43_1","volume-title":"Rebooting research on detecting repackaged Android apps: Literature review and benchmark","author":"Li Li","year":"2019","unstructured":"Li Li, Tegawend\u00e9 F. Bissyand\u00e9, and Jacques Klein. 2019. Rebooting research on detecting repackaged Android apps: Literature review and benchmark. IEEE Transactions on Software Engineering (TSE) (2019)."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2931037.2931044"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2017.04.001"},{"key":"e_1_2_1_46_1","volume-title":"Collecting millions of Android apps and their metadata for the research community. arXiv preprint arXiv:1709.05281","author":"Li Li","year":"2017","unstructured":"Li Li, Jun Gao, M\u00e9d\u00e9ric Hurier, Pingfan Kong, Tegawend\u00e9 F. Bissyand\u00e9, Alexandre Bartel, Jacques Klein, and Yves Le Traon. 2017. AndroZoo++: Collecting millions of Android apps and their metadata for the research community. arXiv preprint arXiv:1709.05281 (2017)."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-017-1786-z"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2017.2656460"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00017"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380242"},{"key":"e_1_2_1_51_1","volume-title":"Gordon Ross, and Gianluca Stringhini.","author":"Mariconti Enrico","year":"2017","unstructured":"Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini. 2017. MaMaDroid: Detecting Android malware by building Markov chains of behavioral models. In Network and Distributed Systems Security Symposiym (NDSS\u201917)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3029806.3029823"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3374664.3375746"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2016.7727817"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2837614.2837661"},{"key":"e_1_2_1_56_1","unstructured":"Xiaorui Pan Xueqiang Wang Yue Duan XiaoFeng Wang and Heng Yin. 2017. Dark hazard: Learning-based large-scale discovery of hidden sensitive operations in Android apps. In NDSS."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2013.53"},{"key":"e_1_2_1_58_1","volume-title":"28th USENIX Security Symposium (USENIX Security\u201919)","author":"Pendlebury Feargus","year":"2019","unstructured":"Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. TESSERACT: Eliminating experimental bias in malware classification across space and time. In 28th USENIX Security Symposium (USENIX Security\u201919). USENIX Association, Santa Clara, CA, 729--746. https:\/\/www.usenix.org\/conference\/usenixsecurity19\/presentation\/pendlebury."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2420950.2420999"},{"key":"e_1_2_1_60_1","volume-title":"Gaussian mixture models.Encyclopedia of Biometrics 741","author":"Reynolds Douglas A.","year":"2009","unstructured":"Douglas A. Reynolds. 2009. Gaussian mixture models.Encyclopedia of Biometrics 741 (2009), 659--663."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/21.97458"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33018-6_30"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-45719-2_11"},{"key":"e_1_2_1_64_1","volume-title":"Taming reflection: An essential step towards whole-program analysis of Android apps. ACM Transactions on Software Engineering and Methodology (TOSEM)","author":"Sun Xiaoyu","year":"2020","unstructured":"Xiaoyu Sun, Li Li, Tegawend\u00e9 F. Bissyand\u00e9, Jacques Klein, Damien Octeau, and John Grundy. 2020. Taming reflection: An essential step towards whole-program analysis of Android apps. ACM Transactions on Software Engineering and Methodology (TOSEM) (2020)."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2018.2876537"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2016.2584050"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSR.2019.00067"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-60876-1_12"},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the 12th IEEE International Workshop on Program Comprehension","author":"Wen Zhihua","year":"2004","unstructured":"Zhihua Wen and Vassilios Tzerpos. 2004. An effectiveness measure for software clustering algorithms. In Proceedings of the 12th IEEE International Workshop on Program Comprehension, 2004. IEEE, 194--203."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2016.03.004"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-02450-5_11"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3180155.3180223"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2017.04.007"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2011.07.005"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/2619239.2631434"},{"key":"e_1_2_1_76_1","volume-title":"2019 International Symposium on Theoretical Aspects of Software Engineering (TASE\u201919)","author":"Zhiwu Xu","year":"2019","unstructured":"Xu Zhiwu, Kerong Ren, and Fu Song. 2019. Android malware family classification and characterization using CFG and DFG. In 2019 International Symposium on Theoretical Aspects of Software Engineering (TASE\u201919). IEEE, 49--56."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2012.16"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3446905","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3446905","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:26Z","timestamp":1750195706000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3446905"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,8]]},"references-count":77,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,7,31]]}},"alternative-id":["10.1145\/3446905"],"URL":"https:\/\/doi.org\/10.1145\/3446905","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,8]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}