{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T17:29:49Z","timestamp":1774718989707,"version":"3.50.1"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,2,28]],"date-time":"2023-02-28T00:00:00Z","timestamp":1677542400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science Foundation Graduate Research Fellowship","award":["DGE1745016 and DGE2140739"],"award-info":[{"award-number":["DGE1745016 and DGE2140739"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2023,2,28]]},"abstract":"<jats:p>Given a million escort advertisements, how can we spot near-duplicates? Such micro-clusters of ads are usually signals of human trafficking (HT). How can we summarize them to convince law enforcement to act? Spotting micro-clusters of near-duplicate documents is useful in multiple, additional settings, including spam-bot detection in Twitter ads, plagiarism, and more.<\/jats:p>\n          <jats:p>\n            We present\n            <jats:sc>InfoShield<\/jats:sc>\n            , which makes the following contributions:\n            <jats:italic>practical<\/jats:italic>\n            , being scalable and effective on real data;\n            <jats:italic>parameter-free and principled<\/jats:italic>\n            , requiring no user-defined parameters;\n            <jats:italic>interpretable<\/jats:italic>\n            , finding a document to be the cluster representative, highlighting all the common phrases, and automatically detecting \u201cslots\u201d (i.e., phrases that differ in every document); and\n            <jats:italic>generalizable<\/jats:italic>\n            , beating or matching domain-specific methods in Twitter bot detection and HT detection, respectively, as well as being language independent. Interpretability is particularly important for the anti-HT domain, where law enforcement must visually inspect ads.\n          <\/jats:p>\n          <jats:p>\n            Our experiments on real data show that\n            <jats:sc>InfoShield<\/jats:sc>\n            correctly identifies Twitter bots with an F1 score over 90% and detects HT ads with 84% precision. Moreover, it is scalable, requiring about 8 hours for 4 million documents on a stock laptop. Our incremental version,\n            <jats:sc>DeltaShield<\/jats:sc>\n            , allows for fast, incremental updates, with minor loss of accuracy.\n          <\/jats:p>","DOI":"10.1145\/3563040","type":"journal-article","created":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T13:24:59Z","timestamp":1675776299000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["DeltaShield: Information Theory for Human- Trafficking Detection"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9646-9190","authenticated-orcid":false,"given":"Catalina","family":"Vajiac","sequence":"first","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6271-8558","authenticated-orcid":false,"given":"Meng-Chieh","family":"Lee","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7954-514X","authenticated-orcid":false,"given":"Aayushi","family":"Kulshrestha","sequence":"additional","affiliation":[{"name":"McGill University &amp; Mila"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8588-1211","authenticated-orcid":false,"given":"Sacha","family":"Levy","sequence":"additional","affiliation":[{"name":"McGill University &amp; Mila"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3344-2361","authenticated-orcid":false,"given":"Namyong","family":"Park","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6225-6621","authenticated-orcid":false,"given":"Andreas","family":"Olligschlaeger","sequence":"additional","affiliation":[{"name":"Marinus Analytics"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9112-4751","authenticated-orcid":false,"given":"Cara","family":"Jones","sequence":"additional","affiliation":[{"name":"Marinus Analytics"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2348-0353","authenticated-orcid":false,"given":"Reihaneh","family":"Rabbany","sequence":"additional","affiliation":[{"name":"McGill University &amp; Mila"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2996-9790","authenticated-orcid":false,"given":"Christos","family":"Faloutsos","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3,30]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.comcom.2013.04.004"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1186\/s13388-017-0029-8"},{"key":"e_1_3_2_4_2","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1145\/304182.304187","volume-title":"Proceedings of SIGMOD","author":"Ankerst Mihael","year":"1999","unstructured":"Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and J\u00f6rg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of SIGMOD. 49\u201360."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(87)90316-0"},{"key":"e_1_3_2_6_2","volume-title":"Proceedings of HLT-NAACL","author":"Barzilay Regina","year":"2003","unstructured":"Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of HLT-NAACL."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_8_2","first-page":"398","volume-title":"Proceedings of SIGMOD.","author":"Brin Sergey","year":"1995","unstructured":"Sergey Brin, James Davis, and H\u00e9ctor Garc\u00eda-Molina. 1995. Copy detection mechanisms for digital documents. In Proceedings of SIGMOD.398\u2013409."},{"key":"e_1_3_2_9_2","first-page":"237","volume-title":"Proceedings of SIGMOD","author":"Brinkhoff Thomas","year":"1993","unstructured":"Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1993. Efficient processing of spatial joins using R-trees. In Proceedings of SIGMOD. 237\u2013246."},{"key":"e_1_3_2_10_2","volume-title":"Proceedings of KDD","author":"Chakrabarti Deepayan","year":"2004","unstructured":"Deepayan Chakrabarti, Spiros Papadimitriou, Dharmendra S. Modha, and Christos Faloutsos. 2004. Fully automatic cross-associations. In Proceedings of KDD."},{"key":"e_1_3_2_11_2","volume-title":"An Introduction to Optimization","author":"Chong Edwin K. P.","year":"2004","unstructured":"Edwin K. P. Chong and Stanislaw H. Zak. 2004. An Introduction to Optimization. John Wiley & Sons."},{"key":"e_1_3_2_12_2","first-page":"963","volume-title":"Proceedings of WWW.","author":"Cresci Stefano","year":"2017","unstructured":"Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio Tesconi. 2017. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of WWW.963\u2013972."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2016.29"},{"key":"e_1_3_2_14_2","first-page":"273","volume-title":"Proceedings of WWW.","author":"Davis Clayton Allen","year":"2016","unstructured":"Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. BotOrNot: A system to evaluate social bots. In Proceedings of WWW.273\u2013274."},{"key":"e_1_3_2_15_2","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume":"1810","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv abs\/1810.04805 (2019).","journal-title":"arXiv"},{"key":"e_1_3_2_16_2","first-page":"497","volume-title":"Proceedings of SIAM DM","author":"Ding Chris H. Q.","year":"2004","unstructured":"Chris H. Q. Ding and Xiaofeng He. 2004. Principal component analysis and effective k-means clustering. In Proceedings of SIAM DM. 497\u2013501."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.5120\/15038-3384"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"Saeideh Shahrokh Esfahani Michael J. Cafarella Maziyar Baran Pouyan Gregory J. DeAngelo Elena Eneva and Andy E. Fano. 2019. Context-specific language modeling for human trafficking detection from online advertisements. In Proceedings of ACL . 1180\u20131184.","DOI":"10.18653\/v1\/P19-1114"},{"key":"e_1_3_2_19_2","first-page":"226","volume-title":"Proceedings of KDD","author":"Ester Martin","year":"1996","unstructured":"Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of KDD. 226\u2013231."},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","unstructured":"Maria Giatsoglou Despoina Chatzakou Neil Shah Alex Beutel Christos Faloutsos and Athena Vakali. 2015. ND-sync: Detecting synchronized fraud activities. In Proceedings of PAKDD . 201\u2013214.","DOI":"10.1007\/978-3-319-18032-8_16"},{"key":"e_1_3_2_21_2","volume-title":"How We Respond to Inauthentic Behavior on Our Platforms: Policy Update","author":"Gleicher Nathaniel","year":"2019","unstructured":"Nathaniel Gleicher. 2019. How We Respond to Inauthentic Behavior on Our Platforms: Policy Update. Retrieved February 14, 2023 from https:\/\/about.fb.com\/news\/2019\/10\/inauthentic-behavior-policy-update\/."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/4643.001.0001"},{"key":"e_1_3_2_23_2","first-page":"47","volume-title":"Proceedings of SIGMOD","author":"Guttman A.","year":"1984","unstructured":"A. Guttman. 1984. R-tree: A dynamic index structure for spatial searching. In Proceedings of SIGMOD. 47\u201357."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1080\/00437956.1954.11659520"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01908075"},{"key":"e_1_3_2_26_2","unstructured":"International Labour Office. 2012. ILO Global Estimate of Forced Labour. Retrieved February 14 2023 from http:\/\/www.ilo.org\/wcmsp5\/groups\/public\/---ed_norm\/---declaration\/documents\/publication\/wcms_182004.pdf."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1108\/00220410410560573"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.25103\/jestr.095.02"},{"key":"e_1_3_2_29_2","article-title":"FlagIt: A system for minimally supervised human trafficking indicator mining","volume":"1712","author":"Kejriwal Mayank","year":"2017","unstructured":"Mayank Kejriwal, Jiayuan Ding, Runqi Shao, Anoop Kumar, and Pedro A. Szekely. 2017. FlagIt: A system for minimally supervised human trafficking indicator mining. CoRR abs\/1712.03086 (2017).","journal-title":"CoRR"},{"key":"e_1_3_2_30_2","volume-title":"Proceedings of KDD","author":"Keogh Eamonn","year":"2004","unstructured":"Eamonn Keogh, Stefano Lonardi, and Chotirat Ann Ratanamahatana. 2004. Towards parameter-free data mining. In Proceedings of KDD."},{"key":"e_1_3_2_31_2","volume-title":"Detection of Organized Activity in Online Escort Advertisements","author":"Kulshrestha Aayushi","year":"2021","unstructured":"Aayushi Kulshrestha. 2021. Detection of Organized Activity in Online Escort Advertisements. McGill University (Canada)."},{"key":"e_1_3_2_32_2","first-page":"II-1188\u2013II-1196","volume-title":"Proceedings of ICML.","author":"Le Quoc","year":"2014","unstructured":"Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of ICML.II-1188\u2013II-1196."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/18.3.452"},{"key":"e_1_3_2_34_2","unstructured":"Meng-Chieh Lee Catalina Vajiac Aayushi Kulshrestha Sacha Levy Namyong Park Cara Jones Reihaneh Rabbany and Christos Faloutsos. 2021. InfoShield: Generalizable information-theoretic human-trafficking detection. In Proceedings of ICDE . IEEE Los Alamitos CA."},{"key":"e_1_3_2_35_2","first-page":"3111","volume-title":"Proceedings of IEEE Big Data","author":"Li L.","year":"2018","unstructured":"L. Li, O. Simek, A. Lai, M. Daggett, C. K. Dagli, and C. Jones. 2018. Detection and characterization of human trafficking networks using unsupervised scalable text template matching. In Proceedings of IEEE Big Data. 3111\u20133120."},{"key":"e_1_3_2_36_2","first-page":"209","article-title":"Spatial joins using seeded trees","author":"Lo Ming-Ling","year":"1994","unstructured":"Ming-Ling Lo and Chinya V. Ravishankar. 1994. Spatial joins using seeded trees. In Proceedings of SIGMOD.209\u2013220.","journal-title":"Proceedings of SIGMOD."},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.4230\/OASIcs.SLATE.2014.143"},{"key":"e_1_3_2_38_2","first-page":"193","volume-title":"Proceedings of SIGMOD","author":"Matsubara Yasuko","year":"2014","unstructured":"Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. 2014. AutoPlait: Automatic mining of co-evolving time sequences. In Proceedings of SIGMOD. 193\u2013204."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.21105\/joss.00205"},{"key":"e_1_3_2_40_2","volume-title":"Proceedings of EDBT","author":"Mehta M.","year":"1996","unstructured":"M. Mehta, R. Agrawal, and J. Rissanen. 1996. SLIQ: A fast scalable classifier for data mining. In Proceedings of EDBT."},{"key":"e_1_3_2_41_2","volume-title":"Proceedings of ICLR","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of ICLR."},{"key":"e_1_3_2_42_2","unstructured":"Ann Wagner. n.d. Human Trafficking & Online Prostitution Advertising. Retrieved February 14 2023 from XXX."},{"key":"e_1_3_2_43_2","unstructured":"Thorn. 2015. A Report on the Use of Technology to Recruit Groom and Sell Domestic Minor Sex Trafficking Victims. Retrieved February 14 2023 from http:\/\/www.thorn.org\/wp-content\/uploads\/2015\/02\/Survivor_Survey_r5.pdf."},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Chirag Nagpal Kyle Miller Benedikt Boecking and Artur Dubrawski. 2017. An entity resolution approach to isolate instances of human trafficking online. In Proceedings of NUT@EMNLP . 77\u201384.","DOI":"10.18653\/v1\/W17-4411"},{"key":"e_1_3_2_45_2","doi-asserted-by":"crossref","unstructured":"Rebecca S. Portnoff Danny Yuxing Huang Periwinkle Doerfler Sadia Afroz and Damon McCoy. 2017. Backpage and bitcoin: Uncovering human traffickers. In Proceedings of KDD .","DOI":"10.1145\/3097983.3098082"},{"key":"e_1_3_2_46_2","doi-asserted-by":"crossref","unstructured":"Reihaneh Rabbany David Bayani and Artur Dubrawski. 2018. Active search of connections for case building and combating human trafficking. In Proceedings of KDD .","DOI":"10.1145\/3219819.3220103"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1016\/0005-1098(78)90005-5"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1214\/aos\/1176346150"},{"key":"e_1_3_2_49_2","unstructured":"Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of EMNLP-CoNLL . 410\u2013420. https:\/\/www.aclweb.org\/anthology\/D07-1043."},{"key":"e_1_3_2_50_2","first-page":"1069","volume-title":"Proceedings of ICDM","author":"Shah Neil","year":"2017","unstructured":"Neil Shah, Hemank Lamba, Alex Beutel, and Christos Faloutsos. 2017. The many faces of link fraud. In Proceedings of ICDM. IEEE, Los Alamitos, CA, 1069\u20131074."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3305260"},{"key":"e_1_3_2_52_2","first-page":"747","volume-title":"Proceedings of COLING\/ACL","author":"Shen Siwei","year":"2006","unstructured":"Siwei Shen, Dragomir R. Radev, Agam Patel, and G\u00fcne\u015f Erkan. 2006. Adding syntax to dynamic programming for aligning comparable texts for the generation of paraphrases. In Proceedings of COLING\/ACL. 747\u2013754."},{"key":"e_1_3_2_53_2","first-page":"1547","volume-title":"Proceedings of ACL","author":"Tong Edmund","year":"2017","unstructured":"Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of ACL. 1547\u20131556."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2013.2267732"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3395046"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563040","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3563040","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:38:09Z","timestamp":1750178289000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3563040"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,28]]},"references-count":54,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,2,28]]}},"alternative-id":["10.1145\/3563040"],"URL":"https:\/\/doi.org\/10.1145\/3563040","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"value":"1556-4681","type":"print"},{"value":"1556-472X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,28]]},"assertion":[{"value":"2021-09-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-05-08","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}