{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T15:39:37Z","timestamp":1777131577066,"version":"3.51.4"},"reference-count":97,"publisher":"Cambridge University Press (CUP)","issue":"5","license":[{"start":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T00:00:00Z","timestamp":1583280000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to <jats:italic>k<\/jats:italic>-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.<\/jats:p>","DOI":"10.1017\/s1351324920000066","type":"journal-article","created":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T13:38:29Z","timestamp":1583329109000},"page":"551-578","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":11,"title":["Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation"],"prefix":"10.1017","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8049-7410","authenticated-orcid":false,"given":"Pawe\u0142","family":"Cichosz","sequence":"first","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,3,4]]},"reference":[{"key":"S1351324920000066_ref96","unstructured":"Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-2011). Stroudsburg, PA: Association for Computational Linguistics."},{"key":"S1351324920000066_ref95","unstructured":"Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97). San Francisco, CA: Morgan Kaufmann."},{"key":"S1351324920000066_ref93","doi-asserted-by":"publisher","DOI":"10.1080\/01969722.2014.874828"},{"key":"S1351324920000066_ref88","doi-asserted-by":"publisher","DOI":"10.5121\/ijcsit.2011.3201"},{"key":"S1351324920000066_ref87","doi-asserted-by":"publisher","DOI":"10.1016\/0377-0427(87)90125-7"},{"key":"S1351324920000066_ref85","unstructured":"Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection. In Proceedings of the First International Conference on Email and Anti Spam (CEAS-2004)."},{"key":"S1351324920000066_ref83","doi-asserted-by":"publisher","DOI":"10.1007\/BF00116251"},{"key":"S1351324920000066_ref82","volume-title":"Advances in Large Margin Classifiers","author":"Platt","year":"2000"},{"key":"S1351324920000066_ref81","volume-title":"Advances in Kernel Methods: Support Vector Learning","author":"Platt","year":"1998"},{"key":"S1351324920000066_ref80","doi-asserted-by":"publisher","DOI":"10.1016\/j.sigpro.2013.12.026"},{"key":"S1351324920000066_ref77","unstructured":"Oooms, J. (2018). hunspell: Morphological analysis and spell checker for R. R package version 3.0."},{"key":"S1351324920000066_ref76","unstructured":"M\u00fcnz, H. , Li, S. and Carle, G. (2007). Traffic anomaly detection using k-means clustering. In Proceedings of the Fourth GI\/ITG-Workshop MMBnet."},{"key":"S1351324920000066_ref71","unstructured":"McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI\/ICML-98 Workshop on Learning for Text Categorization. Menlo Park, CA: AAAI Press."},{"key":"S1351324920000066_ref70","unstructured":"Meyer, D. , Dimitriadou, E. , Hornik, K. , Weingessel, A. and Leisch, F. (2018). e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0."},{"key":"S1351324920000066_ref69","unstructured":"Meyer, D. and Buchta, C. (2018). proxy: Distance and similarity measures. R package version 0.4-22."},{"key":"S1351324920000066_ref67","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"key":"S1351324920000066_ref66","first-page":"139","article-title":"One-class SVMs for document classification","volume":"2","author":"Manevitz","year":"2002","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324920000066_ref64","unstructured":"Maechler, M. , Rousseeuw, P. , Struyf, A. , Hubert, M. and Hornik, K. (2018). cluster: Cluster analysis basics and extensions. R package version 2.0.7-1."},{"key":"S1351324920000066_ref63","unstructured":"MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press."},{"key":"S1351324920000066_ref61","unstructured":"Lloyd, S.P. (1957). Least Squares Quantization in PCM. Technical report Bell Laboratories. Reprinted in 1982 in IEEE Transactions on Information Theory, 28:128\u2013137."},{"key":"S1351324920000066_ref90","doi-asserted-by":"publisher","DOI":"10.1145\/505282.505283"},{"key":"S1351324920000066_ref60","doi-asserted-by":"publisher","DOI":"10.1145\/2133360.2133363"},{"key":"S1351324920000066_ref75","first-page":"551","article-title":"Latent semantic indexing for patent documents","volume":"15","author":"Moldovan","year":"2005","journal-title":"International Journal of Applied Mathematics and Computer Science"},{"key":"S1351324920000066_ref59","first-page":"18","article-title":"Classification and regression by randomForest","volume":"2","author":"Liaw","year":"2002","journal-title":"R News"},{"key":"S1351324920000066_ref58","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2009.09.003"},{"key":"S1351324920000066_ref54","unstructured":"Kumaraswamy, R. , Wazalwar, A. and Khot, K. (2015). Anomaly detection in text: The value of domain knowledge. In Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference (FLAIRS-2015). Menlo Park, CA: AAAI Press."},{"key":"S1351324920000066_ref52","unstructured":"Kramer, S. (2010). Anomaly detection in extremist web forums using a dynamical systems approach. In Proceedings of the ACM SIGKDD Workshop on Intelligence and Security Informatics (ISI-KDD-2010). New York, NY: ACM Press."},{"key":"S1351324920000066_ref72","unstructured":"Mikolov, T. , Chen, K. , Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781."},{"key":"S1351324920000066_ref51","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2006.12.005"},{"key":"S1351324920000066_ref50","doi-asserted-by":"publisher","DOI":"10.1002\/9780470316801"},{"key":"S1351324920000066_ref97","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11161"},{"key":"S1351324920000066_ref49","unstructured":"Kannan, R. , Woo, H. , Aggarwal, C.C. , and Park, H. (2017). Outlier detection for text data. In Proceedings of the 2017 SIAM International Conference on Data Mining. Philadelphia, PA: SIAM."},{"key":"S1351324920000066_ref46","doi-asserted-by":"publisher","DOI":"10.5120\/ijais12-450391"},{"key":"S1351324920000066_ref45","volume-title":"Algorithms for Clustering Data","author":"Jain","year":"1988"},{"key":"S1351324920000066_ref43","doi-asserted-by":"publisher","DOI":"10.1027\/1864-1105\/a000062"},{"key":"S1351324920000066_ref42","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8655(03)00003-5"},{"key":"S1351324920000066_ref92","unstructured":"Syarif, I. , Prugel-Bennett, A. and Wills, G. (2012). Unsupervised clustering approach for network anomaly detection. In Proceedings of the Fourth International Conference on Networked Digital Technologies (NDT-2012). Heidelberg, Germany: Springer."},{"key":"S1351324920000066_ref41","unstructured":"Hassan, S. , Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification. In Proceedings of the First IEEE International Conference on Semantic Computing (ICSC-2007). Los Alamitos, CA: IEEE Computer Society."},{"key":"S1351324920000066_ref44","doi-asserted-by":"publisher","DOI":"10.1561\/1500000062"},{"key":"S1351324920000066_ref39","doi-asserted-by":"publisher","DOI":"10.1002\/9780470503065"},{"key":"S1351324920000066_ref38","doi-asserted-by":"publisher","DOI":"10.1023\/A:1012801612483"},{"key":"S1351324920000066_ref37","unstructured":"Guthrie, D. , Guthrie, L. , Allison, B. and Wilks, Y. (2007). Unsupervised anomaly detection. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-2007). San Francisco, CA: Morgan Kaufmann."},{"key":"S1351324920000066_ref35","unstructured":"Goldstein, M. and Uchida, S. (2014). Behavior analysis using unsupervised anomaly detection. In Proceedings of the Tenth Joint Workshop on Machine Perception and Robotics (MPR-2014)."},{"key":"S1351324920000066_ref34","unstructured":"Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving Mikolov et al.\u2019s negative sampling word-embedding method. arXiv preprint arXiv:1402.3722."},{"key":"S1351324920000066_ref33","doi-asserted-by":"publisher","DOI":"10.1109\/ICMSS.2009.5302396"},{"key":"S1351324920000066_ref32","first-page":"1289","article-title":"An extensive empirical study of feature selection measures for text classification","volume":"3","author":"Forman","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324920000066_ref30","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2005.10.010"},{"key":"S1351324920000066_ref65","doi-asserted-by":"publisher","DOI":"10.3390\/a5040469"},{"key":"S1351324920000066_ref53","first-page":"551","article-title":"Latent semantic indexing using eigenvalue analysis for efficient information retrieval","volume":"16","author":"Kumar","year":"2006","journal-title":"International Journal of Applied Mathematics and Computer Science"},{"key":"S1351324920000066_ref5","unstructured":"Amer, M. and Goldstein, M. (2012). Nearest-neighbor and clustering based anomaly detection algorithms for RapidMiner. In Proceedings of the Third RapidMiner Community Meeting and Conference (RCOMM-2012). D\u00dcren, Germany: Shaker."},{"key":"S1351324920000066_ref36","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0152173"},{"key":"S1351324920000066_ref55","unstructured":"Lau, J.H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the First Workshop on Representation Learning for NLP. Stroudsburg, PA: Association for Computational Linguistics."},{"key":"S1351324920000066_ref26","doi-asserted-by":"publisher","DOI":"10.1002\/aris.1440380105"},{"key":"S1351324920000066_ref56","unstructured":"Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the Thirty-First International Conference on Machine Learning (ICML-2014). JMLR Workshop and Conference Proceedings."},{"key":"S1351324920000066_ref86","unstructured":"Rousseau, F. , Kiagias, E. and Vazirgiannis, M. (2015). Text categorization as a graph classification problem. In Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics and the Sixth International Joint Conference on Natural Language Processing (ACL-IJCNLP-2015). Stroudsburg, PA: Association for Computational Linguistics."},{"key":"S1351324920000066_ref16","volume-title":"Classification and Regression Trees","author":"Breiman","year":"1984"},{"key":"S1351324920000066_ref47","unstructured":"Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning (ECML-98). Berlin, Germany: Springer."},{"key":"S1351324920000066_ref78","unstructured":"Pande, A. and Ahuja, V. (2017). WEAC: Word embeddings for anomaly classification from event logs. In Proceedings of the 2017 IEEE International Conference on Big Data. Los Alamitos, CA: IEEE Computer Society."},{"key":"S1351324920000066_ref19","unstructured":"Chen, T. , Tang, L.-A. , Sun, Y. , Chen, Z. and Zhang, K. (2016). Entity embedding-based anomaly detection for heterogeneous categorical events. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-2016). Menlo Park, CA: AAAI Press."},{"key":"S1351324920000066_ref74","doi-asserted-by":"publisher","DOI":"10.1111\/j.1551-6709.2010.01106.x"},{"key":"S1351324920000066_ref6","volume-title":"Cluster Analysis for Applications","author":"Anderberg","year":"1973"},{"key":"S1351324920000066_ref68","doi-asserted-by":"publisher","DOI":"10.1007\/BF02504837"},{"key":"S1351324920000066_ref9","unstructured":"Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536."},{"key":"S1351324920000066_ref21","doi-asserted-by":"publisher","DOI":"10.1007\/BF00994018"},{"key":"S1351324920000066_ref17","unstructured":"Breunig, M.M. , Kriegel, H.-P. , Ng, R.T. and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: ACM Press."},{"key":"S1351324920000066_ref29","doi-asserted-by":"publisher","DOI":"10.1002\/9780470977811"},{"key":"S1351324920000066_ref25","first-page":"2121","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324920000066_ref48","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4615-0907-3"},{"key":"S1351324920000066_ref15","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010933404324"},{"key":"S1351324920000066_ref14","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324920000066_ref94","unstructured":"Xue, D. and Li, F. (2015). Research of text categorization model based on random forests. In 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT-2015). Los Alamitos, CA: IEEE Computer Society."},{"key":"S1351324920000066_ref8","unstructured":"Auslander, B. , Gupta, K.M. and Aha, D.W. (2011). A comparative evaluation of anomaly detection algorithms for maritime video surveillance. In Proceedings of SPIE 8019: Sensors, and Command, Control, Communications, and Intelligence, (C3I) Technologies for Homeland Security and Homeland Defense X. Bellingham, WA: SPIE."},{"key":"S1351324920000066_ref28","volume-title":"Signal Detection Theory and ROC Analysis","author":"Egan","year":"1975"},{"key":"S1351324920000066_ref40","volume-title":"Clustering Algorithms","author":"Hartigan","year":"1975"},{"key":"S1351324920000066_ref62","unstructured":"Lui, A.K.-F. , Li, S.C. and Choy, S.O. (2007). An evaluation of automatic text categorization in online discussion analysis. In Proceedings of the Seventh IEEE International Conference on Advanced Learning Technologies (ICALT-2007). Los Alamitos, CA: IEEE Computer Society."},{"key":"S1351324920000066_ref18","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541882"},{"key":"S1351324920000066_ref89","doi-asserted-by":"publisher","DOI":"10.1162\/089976601750264965"},{"key":"S1351324920000066_ref4","unstructured":"Allahyari, M. , Pouriyeh, S. , Assefi, M. , Safaei, S. , Trippe, E.D. , Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919."},{"key":"S1351324920000066_ref22","unstructured":"Dai, H. (2013). Anomaly Detection on Social Data . Ph.D. Thesis, Singapore Management University."},{"key":"S1351324920000066_ref79","unstructured":"Pennington, J. , Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014). Stroudsburg, PA: Association for Computational Linguistics."},{"key":"S1351324920000066_ref73","unstructured":"Mikolov, T. , Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168."},{"key":"S1351324920000066_ref84","first-page":"227","article-title":"Text mining: Approaches and applications","volume":"38","author":"Radovanovi\u0107","year":"2008","journal-title":"Novi Sad Journal of Mathematics"},{"key":"S1351324920000066_ref1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-3223-4"},{"key":"S1351324920000066_ref2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-3223-4"},{"key":"S1351324920000066_ref57","doi-asserted-by":"publisher","DOI":"10.4304\/jsw.7.5.1045-1051"},{"key":"S1351324920000066_ref3","first-page":"310","article-title":"An effective clustering-based approach for outlier detection","volume":"28","author":"Al-Zoubi","year":"2009","journal-title":"European Journal of Scientific Research"},{"key":"S1351324920000066_ref10","unstructured":"Bakarov, A. , Yadrintsev, V. and Sochenkov, I. (2018). Anomaly detection for short texts: Identifying whether your chatbot should switch from goal-oriented conversation to chit-chatting. In Proceedings of the International Conference on Digital Transformation and Global Society (DTGS-2018). Cham, Switzerland: Springer."},{"key":"S1351324920000066_ref27","unstructured":"Dumais, S.T. , Platt, J.C. , Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM-98). New York: ACM Press."},{"key":"S1351324920000066_ref31","doi-asserted-by":"publisher","DOI":"10.1109\/ICSSSM.2008.4598504"},{"key":"S1351324920000066_ref7","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2012.07.021"},{"key":"S1351324920000066_ref11","first-page":"7","article-title":"Document clustering using k-means and k-medoids","volume":"1","author":"Balabantaray","year":"2013","journal-title":"International Journal of Knowledge Based Computer System"},{"key":"S1351324920000066_ref91","unstructured":"Selivanov, D. and Quing, W. (2018). text2vec: Modern text mining framework for R. R package version 0.5.1."},{"key":"S1351324920000066_ref12","unstructured":"Bertero, C. , Roy, M. , Sauvanaud, C. and Tr\u00e9dan, G. (2017). Experience report: Log mining using natural language processing and application to anomaly detection. In Proceedings of the Twenty-Eighth International Symposium on Software Reliability Engineering (ISSRE-2017). Los Alamitos, CA: IEEE Computer Society."},{"key":"S1351324920000066_ref13","doi-asserted-by":"publisher","DOI":"10.1109\/3477.678624"},{"key":"S1351324920000066_ref20","doi-asserted-by":"publisher","DOI":"10.1002\/9781118950951"},{"key":"S1351324920000066_ref23","first-page":"25","article-title":"Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions","volume":"12","author":"Da\u0159ena","year":"2017","journal-title":"Journal of Applied Economic Sciences"},{"key":"S1351324920000066_ref24","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007612920971"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000066","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,8,10]],"date-time":"2020-08-10T13:33:26Z","timestamp":1597066406000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000066\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,4]]},"references-count":97,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2020,9]]}},"alternative-id":["S1351324920000066"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000066","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,4]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}