{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T14:12:59Z","timestamp":1760710379029,"version":"3.41.0"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2020,10,3]],"date-time":"2020-10-03T00:00:00Z","timestamp":1601683200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2020,12,31]]},"abstract":"<jats:p>Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.<\/jats:p>","DOI":"10.1145\/3399712","type":"journal-article","created":{"date-parts":[[2020,10,3]],"date-time":"2020-10-03T10:10:58Z","timestamp":1601719858000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking"],"prefix":"10.1145","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5720-7737","authenticated-orcid":false,"given":"Wathsala Anupama","family":"Mohotti","sequence":"first","affiliation":[{"name":"Queensland University of Technology, Australia"}]},{"given":"Richi","family":"Nayak","sequence":"additional","affiliation":[{"name":"Queensland University of Technology, Australia"}]}],"member":"320","published-online":{"date-parts":[[2020,10,3]]},"reference":[{"volume-title":"Data Mining","author":"Aggarwal Charu C.","key":"e_1_2_1_1_1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-14142-8"},{"volume":"30","volume-title":"Proceedings of the ACM SIGMOD International Conference on Management of Data.","author":"Charu","key":"e_1_2_1_2_1"},{"volume-title":"Aggarwal and ChengXiang Zhai","year":"2012","author":"Charu","key":"e_1_2_1_3_1"},{"volume-title":"Proceedings of the 10th IEEE Symposium on Computers and Communications (ISCC\u201905)","author":"Agyemang Malik","key":"e_1_2_1_4_1"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/304182.304187"},{"volume-title":"Park","year":"2012","author":"Aouf Mazin","key":"e_1_2_1_6_1"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1458082.1458157"},{"volume-title":"Proceedings of International Conference on Machine Learning.","year":"1999","author":"Baker L. Douglas","key":"e_1_2_1_8_1"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 19th Conference on Australasian Database -","volume":"75","author":"Bennett Graham","year":"2008"},{"key":"e_1_2_1_10_1","first-page":"2273","article-title":"Outlier detection for robust multi-dimensional scaling","volume":"49","author":"Blouvshtein Leonid","year":"2018","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335388"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1177\/0272989X0002000410"},{"volume-title":"Information Retrieval and the Vector Space Model","author":"Cercone Nick","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.01.002"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0079449"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10479-008-0371-9"},{"key":"e_1_2_1_17_1","unstructured":"Elasticsearch. 2019. Similarity Module. Retrieved from https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/master\/index-modules-similarity.html#index-modules-similarity. Elasticsearch. 2019. Similarity Module. Retrieved from https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/master\/index-modules-similarity.html#index-modules-similarity."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972733.5"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining.","volume":"96","author":"Ester Martin","year":"1996"},{"volume-title":"Proceedings of the 19th International Conference on Digital Audio Effects.","year":"2016","author":"Flexer Arthur","key":"e_1_2_1_20_1"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-011-9173-9"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2004.1334558"},{"volume-title":"Identification of Outliers","author":"Hawkins Douglas M.","key":"e_1_2_1_23_1"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8655(03)00003-5"},{"key":"e_1_2_1_25_1","volume-title":"Ldv Forum","volume":"20","author":"Hotho Andreas","year":"2005"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2015.10.014"},{"key":"e_1_2_1_27_1","unstructured":"IBM. 2017. Big Data and Analytics Hub. Retieved from https:\/\/www.ibmbigdatahub.com\/blog\/what-text-analytics. IBM. 2017. Big Data and Analytics Hub. Retieved from https:\/\/www.ibmbigdatahub.com\/blog\/what-text-analytics."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1002\/env.628"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/0020-0271(71)90051-9"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/T-C.1973.223640"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611974973.55"},{"volume-title":"ARAML: A stable adversarial training framework for text generation. Arxiv Preprint Arxiv:1908.07195","year":"2019","author":"Ke Pei","key":"e_1_2_1_32_1"},{"volume-title":"Proceedings of the International Conference on Very Large Data Bases. 392--403","author":"Edwin","key":"e_1_2_1_33_1"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-39712-7_8"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.3390\/info10040150"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-01307-2_86"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1401890.1401946"},{"volume-title":"Proceedings of the International Conference on Machine Learning. 1188--1196","year":"2014","author":"Le Quoc","key":"e_1_2_1_38_1"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313625"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMC.2017.2718220"},{"volume-title":"Proceedings of the 32nd AAAI Conference on Artificial Intelligence.","year":"2018","author":"Liu Linqing","key":"e_1_2_1_41_1"},{"key":"e_1_2_1_42_1","first-page":"1517","article-title":"Generative adversarial active learning for unsupervised outlier detection","volume":"32","author":"Liu Yezheng","year":"2020","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016826"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2018.00066"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93040-4_35"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_2_1_47_1","first-page":"2487","article-title":"Hubs in space: Popular nearest neighbors in high-dimensional data","author":"Radovanovi\u0107 Milo\u0161","year":"2010","journal-title":"Journal of Machine Learning Research 11"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2365790"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335437"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390409"},{"volume-title":"Term-weighting approaches in automatic text retrieval. Information Processing 8 Management 24, 5","year":"1988","author":"Salton Gerard","key":"e_1_2_1_51_1"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-18123-3_2"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-26187-4_16"},{"volume-title":"Open event extraction from online text using a generative adversarial network. Arxiv Preprint Arxiv:1908.09246","year":"2019","author":"Wang Rui","key":"e_1_2_1_54_1"},{"volume-title":"Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 18--25","author":"Michael Wong S. K.","key":"e_1_2_1_55_1"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-63579-8_47"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2018.06.013"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/1076034.1076120"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367550"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.5555\/3225639.3225805"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/BDCloud.2015.51"},{"volume-title":"Data Clustering","author":"Zimek Arthur","key":"e_1_2_1_62_1"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3399712","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3399712","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:38:13Z","timestamp":1750199893000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3399712"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,3]]},"references-count":62,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,12,31]]}},"alternative-id":["10.1145\/3399712"],"URL":"https:\/\/doi.org\/10.1145\/3399712","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"type":"print","value":"1556-4681"},{"type":"electronic","value":"1556-472X"}],"subject":[],"published":{"date-parts":[[2020,10,3]]},"assertion":[{"value":"2018-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-10-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}