{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,12]],"date-time":"2026-05-12T03:42:20Z","timestamp":1778557340630,"version":"3.51.4"},"reference-count":32,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2018,7,27]],"date-time":"2018-07-27T00:00:00Z","timestamp":1532649600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000006","name":"Office of Naval Research","doi-asserted-by":"publisher","award":["N00014-15-1-2742"],"award-info":[{"award-number":["N00014-15-1-2742"]}],"id":[{"id":"10.13039\/100000006","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100005740","name":"Universidad Nacional del Sur","doi-asserted-by":"publisher","award":["PGI 24\/ZN34"],"award-info":[{"award-number":["PGI 24\/ZN34"]}],"id":[{"id":"10.13039\/501100005740","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010663","name":"H2020 European Research Council","doi-asserted-by":"publisher","award":["690974"],"award-info":[{"award-number":["690974"]}],"id":[{"id":"10.13039\/100010663","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002923","name":"Consejo Nacional de Investigaciones Cient\u00edficas y T\u00e9cnicas","doi-asserted-by":"publisher","award":["n\/a"],"award-info":[{"award-number":["n\/a"]}],"id":[{"id":"10.13039\/501100002923","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult\u2014if not impossible\u2014to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.<\/jats:p>","DOI":"10.3390\/info9080189","type":"journal-article","created":{"date-parts":[[2018,7,27]],"date-time":"2018-07-27T12:20:03Z","timestamp":1532694003000},"page":"189","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["First Steps towards Data-Driven Adversarial Deduplication"],"prefix":"10.3390","volume":"9","author":[{"given":"Jose N.","family":"Paredes","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, Argentina"},{"name":"Institute for Computer Science and Engineering (CONICET\u2013UNS), 8000 Bahia Blanca, Argentina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3185-4992","authenticated-orcid":false,"given":"Gerardo I.","family":"Simari","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, Argentina"},{"name":"Institute for Computer Science and Engineering (CONICET\u2013UNS), 8000 Bahia Blanca, Argentina"},{"name":"School of Computing, Informatics, and Decision Systems Engineering (CIDSE), Arizona State University, Tempe, AZ 85281, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Maria Vanina","family":"Martinez","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Universidad de Buenos Aires (UBA), C1428EGA Ciudad Autonoma de Buenos Aires, Argentina"},{"name":"Institute for Computer Science Research (CONICET\u2013UBA), C1428EGA Ciudad Autonoma de Buenos Aires, Argentina"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marcelo A.","family":"Falappa","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, Argentina"},{"name":"Institute for Computer Science and Engineering (CONICET\u2013UNS), 8000 Bahia Blanca, Argentina"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2018,7,27]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate Record Detection: A Survey","volume":"19","author":"Elmagarmid","year":"2007","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1456650.1456651","article-title":"Data Fusion","volume":"41","author":"Bleiholder","year":"2009","journal-title":"ACM Comput. Surv."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nunes, E., Diab, A., Gunn, A.T., Marin, E., Mishra, V., Paliath, V., Robertson, J., Shakarian, J., Thart, A., and Shakarian, P. (arXiv, 2016). Darknet and deepnet mining for proactive cybersecurity threat intelligence, arXiv.","DOI":"10.1109\/ISI.2016.7745435"},{"key":"ref_4","unstructured":"NIST (2018, July 24). National Vulnerability Database, Available online: https:\/\/nvd.nist.gov\/."},{"key":"ref_5","unstructured":"CVE (2018, July 24). Common Vulnerabilities and Exposures: The Standard for Information Security Vulnerability Names. Available online: http:\/\/cve.mitre.org\/."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Shakarian, J., Gunn, A.T., and Shakarian, P. (2016). Exploring Malicious Hacker Forums. Cyber Deception, Building the Scientific Foundation, Springer.","DOI":"10.1007\/978-3-319-32699-3_11"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2018","DOI":"10.14778\/2367502.2367564","article-title":"Entity Resolution: Theory, Practice and Open Challenges","volume":"5","author":"Getoor","year":"2012","journal-title":"Proc. VLDB Endow."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1145\/1217299.1217304","article-title":"Collective Entity Resolution in Relational Data","volume":"1","author":"Bhattacharya","year":"2007","journal-title":"ACM Trans. Knowl. Discov. Data"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. (July, January 29). Entity Resolution with Iterative Blocking. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, RI, USA.","DOI":"10.1145\/1559845.1559870"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"621","DOI":"10.1613\/jair.2290","article-title":"Query-time entity resolution","volume":"30","author":"Bhattacharya","year":"2007","journal-title":"J. Artif. Intell. Res."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1016\/j.ijar.2017.01.003","article-title":"ERBlox: Combining matching dependencies with machine learning for entity resolution","volume":"83","author":"Bahmani","year":"2017","journal-title":"Int. J. Approx. Reason."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Fan, W. (2008, January 9\u201312). Dependencies Revisited for Improving Data Quality. Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Vancouver, BC, Canada.","DOI":"10.1145\/1376916.1376940"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"407","DOI":"10.14778\/1687627.1687674","article-title":"Reasoning About Record Matching Rules","volume":"2","author":"Fan","year":"2009","journal-title":"Proc. VLDB Endow."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"441","DOI":"10.1007\/s00224-012-9402-7","article-title":"Data Cleaning and Query Answering with Matching Dependencies and Matching Functions","volume":"52","author":"Bertossi","year":"2013","journal-title":"Theory Comput. Syst."},{"key":"ref_15","unstructured":"Rao, J.R., and Rohatgi, P. (2000, January 14\u201317). Can pseudonymity really guarantee privacy?. Proceedings of the 9th USENIX Security Symposium, Denver, CO, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Novak, J., Raghavan, P., and Tomkins, A. (2004, January 17\u201322). Anti-aliasing on the web. Proceedings of the 13th International Conference on World Wide Web, Manhattan, NY, USA.","DOI":"10.1145\/988672.988678"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1145\/2382448.2382450","article-title":"Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity","volume":"15","author":"Brennan","year":"2012","journal-title":"ACM Trans. Inf. Syst. Secur."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Swain, S., Mishra, G., and Sindhu, C. (2017, January 20\u201322). Recent approaches on authorship attribution techniques: An overview. Proceedings of the 2017 International Conference of Electronics, Communication and Aerospace Technology, Tamil Nadu, India.","DOI":"10.1109\/ICECA.2017.8203599"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1145\/1344411.1344413","article-title":"Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace","volume":"26","author":"Abbasi","year":"2008","journal-title":"ACM Trans. Inf. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Narayanan, A., Paskov, H., Gong, N.Z., Bethencourt, J., Stefanov, E., Shin, E.C.R., and Song, D. (2012, January 20\u201323). On the feasibility of internet-scale author identification. Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA.","DOI":"10.1109\/SP.2012.46"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Johansson, F., Kaati, L., and Shrestha, A. (2013, January 25\u201328). Detecting multiple aliases in social media. Proceedings of the 2013 IEEE\/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara Falls, ON, Canada.","DOI":"10.1145\/2492517.2500261"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"22","DOI":"10.5769\/J200901002","article-title":"Classification of instant messaging communications for forensics analysis","volume":"1","author":"Orebaugh","year":"2009","journal-title":"Int. J. Forensic Comput. Sci."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1109\/TIFS.2016.2603960","article-title":"Authorship Attribution for Social Media Forensics","volume":"12","author":"Rocha","year":"2017","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1311","DOI":"10.1109\/TIFS.2014.2332820","article-title":"Multiple account identity deception detection in social media using nonverbal behavior","volume":"9","author":"Tsikerdekis","year":"2014","journal-title":"IEEE Trans. Inf. Forensics Secur."},{"key":"ref_25","unstructured":"Ho, T.N., and Ng, W.K. (December, January 29). Application of Stylometry to DarkWeb Forum User Identification. Proceedings of the International Conference on Information and Communications Security, Singapore."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zheng, X., Lai, Y.M., Chow, K.P., Hui, L.C., and Yiu, S.M. (2011, January 14\u201316). Sockpuppet detection in online discussion forums. Proceedings of the Seventh International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Dalian, China.","DOI":"10.1109\/IIHMSP.2011.69"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Kumar, S., Cheng, J., Leskovec, J., and Subrahmanian, V. (2017, January 3\u20137). An army of me: Sockpuppets in online discussion communities. Proceedings of the 26th International Conference on World Wide Web, Perth, Australia.","DOI":"10.1145\/3038912.3052677"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1016\/j.knosys.2018.03.002","article-title":"SocksCatch: Automatic detection and grouping of sockpuppets in social media","volume":"149","author":"Yamak","year":"2018","journal-title":"Knowl.-Based Syst."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Spitters, M., Klaver, F., Koot, G., and van Staalduinen, M. (2015, January 7\u20139). Authorship analysis on dark marketplace forums. Proceedings of the European Intelligence and Security Informatics Conference, Manchester, UK.","DOI":"10.1109\/EISIC.2015.47"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Marin, E., Diab, A., and Shakarian, P. (2016, January 27\u201330). Product offerings in malicious hacker markets. Proceedings of the IEEE Intelligence and Security Informatics 2016 Conference, Tucson, Arizona, USA.","DOI":"10.1109\/ISI.2016.7745465"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Nunes, E., Shakarian, P., and Simari, G.I. (2018, January 15\u201317). At-risk system identification via analysis of discussions on the darkweb. Proceedings of the APWG Symposium on Electronic Crime Research, San Diego, CA, USA.","DOI":"10.1109\/ECRIME.2018.8376211"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Tavabi, N., Goyal, P., Almukaynizi, M., Shakarian, P., and Lerman, K. (2018, January 2\u20137). DarkEmbed: Exploit Prediction with Neural Language Models. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11428"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/9\/8\/189\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:14:40Z","timestamp":1760195680000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/9\/8\/189"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,7,27]]},"references-count":32,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2018,8]]}},"alternative-id":["info9080189"],"URL":"https:\/\/doi.org\/10.3390\/info9080189","relation":{"has-preprint":[{"id-type":"doi","id":"10.20944\/preprints201806.0425.v2","asserted-by":"object"},{"id-type":"doi","id":"10.20944\/preprints201806.0425.v1","asserted-by":"object"}]},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,7,27]]}}}