{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T01:10:33Z","timestamp":1777511433039,"version":"3.51.4"},"reference-count":475,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2022,6,13]],"date-time":"2022-06-13T00:00:00Z","timestamp":1655078400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.<\/jats:p>","DOI":"10.1017\/s1351324922000213","type":"journal-article","created":{"date-parts":[[2022,6,13]],"date-time":"2022-06-13T19:29:56Z","timestamp":1655148596000},"page":"509-553","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":175,"title":["Comparison of text preprocessing methods"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6835-3668","authenticated-orcid":false,"given":"Christine P.","family":"Chai","sequence":"first","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,6,13]]},"reference":[{"key":"S1351324922000213_ref352","unstructured":"Polignano, M. , Basile, P. , De Gemmis, M. , Semeraro, G. and Basile, V. (2019). Alberto: Italian BERT language understanding model for NLP challenging tasks based on Tweets. In 6th Italian Conference on Computational Linguistics, CLiC-it 2019, vol. 2481. CEUR Workshop Proceedings, pp. 1\u20136."},{"key":"S1351324922000213_ref107","unstructured":"Clough, P. (2001). A Perl program for sentence splitting using rules. Technical report, University of Sheffield, Sheffield, United Kingdom."},{"key":"S1351324922000213_ref163","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btl616"},{"key":"S1351324922000213_ref31","doi-asserted-by":"publisher","DOI":"10.1109\/ICSC.2009.101"},{"key":"S1351324922000213_ref155","doi-asserted-by":"publisher","DOI":"10.1111\/j.1467-842X.2011.00628.x"},{"key":"S1351324922000213_ref308","doi-asserted-by":"publisher","DOI":"10.1145\/2184512.2184530"},{"key":"S1351324922000213_ref353","first-page":"113","volume-title":"International Conference on Intelligent Text Processing and Computational Linguistics","author":"Poria","year":"2014"},{"key":"S1351324922000213_ref390","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-4354-1"},{"key":"S1351324922000213_ref460","doi-asserted-by":"publisher","DOI":"10.1109\/ICDIM.2011.6093315"},{"key":"S1351324922000213_ref294","doi-asserted-by":"publisher","DOI":"10.1093\/acrefore\/9780199384655.013.611"},{"key":"S1351324922000213_ref204","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-39940-9_421"},{"key":"S1351324922000213_ref75","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2010.12.007"},{"key":"S1351324922000213_ref389","doi-asserted-by":"crossref","unstructured":"Sarica, S. and Luo, J. (2020). Stopwords in technical language processing. arXiv preprint arXiv:2006.02633.","DOI":"10.1371\/journal.pone.0254937"},{"key":"S1351324922000213_ref438","doi-asserted-by":"publisher","DOI":"10.1109\/ASONAM.2014.6921616"},{"key":"S1351324922000213_ref143","unstructured":"Fan, A. , Doshi-Velez, F. and Miratrix, L. (2017). Promoting domain-specific terms in topic models with informative priors. arXiv preprint arXiv:1701.03227."},{"key":"S1351324922000213_ref103","volume-title":"Deep Structure, Surface Structure, and Semantic Interpretation","author":"Chomsky","year":"1969"},{"key":"S1351324922000213_ref281","first-page":"22","article-title":"Development of a stemming algorithm","volume":"11","author":"Lovins","year":"1968","journal-title":"Mechanical Translation and Computational Linguistics"},{"key":"S1351324922000213_ref435","first-page":"274","article-title":"Feature-based sentiment analysis approach for product reviews","volume":"9","author":"Wang","year":"2014","journal-title":"Journal of Software"},{"key":"S1351324922000213_ref387","doi-asserted-by":"publisher","DOI":"10.1007\/s10044-017-0674-z"},{"key":"S1351324922000213_ref45","doi-asserted-by":"publisher","DOI":"10.3115\/1613715.1613848"},{"key":"S1351324922000213_ref326","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1108"},{"key":"S1351324922000213_ref218","first-page":"1930","article-title":"A comparative study of stemming algorithms","volume":"2","author":"Jivani","year":"2011","journal-title":"International Journal of Computer Technology and Applications (IJCTA)"},{"key":"S1351324922000213_ref307","unstructured":"Mitrofan, M. and Tufi\u015f, D. (2018). Bioro: The biomedical corpus for the Romanian language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)."},{"key":"S1351324922000213_ref48","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-4002"},{"key":"S1351324922000213_ref66","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2016.09.009"},{"key":"S1351324922000213_ref217","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME.2018.00053"},{"key":"S1351324922000213_ref431","first-page":"7","article-title":"Preprocessing techniques for text mining: An overview","volume":"5","author":"Vijayarani","year":"2015","journal-title":"International Journal of Computer Science and Communication Networks"},{"key":"S1351324922000213_ref97","article-title":"Word distinctivity \u2013 Quantifying improvement of topic modeling results from n-gramming","author":"Chai","year":"2022","journal-title":"REVSTAT\u2013Statistical Journal"},{"key":"S1351324922000213_ref366","first-page":"9","article-title":"Detecting emotion from text and emoticon","volume":"17","author":"Rahman","year":"2017","journal-title":"London Journal of Research in Computer Science and Technology"},{"key":"S1351324922000213_ref13","doi-asserted-by":"publisher","DOI":"10.1155\/2016\/4248026"},{"key":"S1351324922000213_ref455","doi-asserted-by":"publisher","DOI":"10.1145\/2884781.2884862"},{"key":"S1351324922000213_ref373","unstructured":"Richardson, L. (2020). Beautiful Soup 4.9.1. Python library. Available from: https:\/\/www.crummy.com\/software\/BeautifulSoup\/."},{"key":"S1351324922000213_ref375","unstructured":"Rinker, T.W. (2018a). lexicon: Lexicon Data. $\\mathsf{R}$ package version 1.2.1. Available from: http:\/\/github.com\/trinker\/lexicon."},{"key":"S1351324922000213_ref239","doi-asserted-by":"publisher","DOI":"10.1162\/coli.2006.32.4.485"},{"key":"S1351324922000213_ref400","doi-asserted-by":"publisher","DOI":"10.5121\/csit.2014.4910"},{"key":"S1351324922000213_ref305","unstructured":"Mikolov, T. , Chen, K. , Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781."},{"key":"S1351324922000213_ref264","unstructured":"Lextek International n.d. Onix text retrieval toolkit \u2013 API reference. Available from: https:\/\/www.lextek.com\/manuals\/onix\/ [last accessed August 2019]."},{"key":"S1351324922000213_ref26","doi-asserted-by":"publisher","DOI":"10.3115\/1614038.1614045"},{"key":"S1351324922000213_ref88","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806236"},{"key":"S1351324922000213_ref289","doi-asserted-by":"publisher","DOI":"10.1002\/9781405198431.wbeal0755"},{"key":"S1351324922000213_ref424","volume-title":"Eats, Shoots and Leaves: Why, Commas Really do make a Difference!","author":"Truss","year":"2006"},{"key":"S1351324922000213_ref409","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-96292-4"},{"key":"S1351324922000213_ref428","unstructured":"Verspoor, C.M. , Joslyn, C. and Papcun, G.J. (2003). The gene ontology as a source of lexical semantic knowledge for a biological natural language processing application. In SIGIR Workshop on Text Analysis and Search for Bioinformatics, pp. 51\u201356."},{"key":"S1351324922000213_ref12","doi-asserted-by":"publisher","DOI":"10.1109\/ICACCI.2017.8125961"},{"key":"S1351324922000213_ref236","doi-asserted-by":"publisher","DOI":"10.1109\/ICWR49608.2020.9122275"},{"key":"S1351324922000213_ref158","unstructured":"Foster, J. , Cetinoglu, O. , Wagner, J. , Le Roux, J. , Hogan, S. , Nivre, J. , Hogan, D. , and Van Genabith, J. 2011. # hardtoparse: POS tagging and parsing the Twitterverse. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. AAAI stands for Association for the Advancement of Artificial Intelligence."},{"key":"S1351324922000213_ref429","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btp195"},{"key":"S1351324922000213_ref74","volume-title":"Deep Learning for Natural Language Processing: Solve your Natural Language Processing Problems with Smart Deep Neural Networks","author":"Bokka","year":"2019"},{"key":"S1351324922000213_ref102","first-page":"1","article-title":"Experimental explorations on short text topic mining between LDA and NMF based schemes. Knowledge-Based Systems","volume":"163","author":"Chen","year":"2019","journal-title":"LDA stands for latent Dirichlet allocation, and NMF stands for non-negative matrix factorization."},{"key":"S1351324922000213_ref98","unstructured":"Chang, J.P. , Chiam, C. , Fu, L. , Wang, A.Z. , Zhang, J. and Danescu-Niculescu-Mizil, C. (2020). Convokit: A toolkit for the analysis of conversations. arXiv preprint arXiv:2005.04246."},{"key":"S1351324922000213_ref361","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1078"},{"key":"S1351324922000213_ref301","first-page":"1","article-title":"Text mining infrastructure in","volume":"25","author":"Meyer","year":"2008","journal-title":"Journal of Statistical Software"},{"key":"S1351324922000213_ref278","first-page":"274","volume-title":"European Conference on Advances in Databases and Information Systems","author":"Loginova","year":"2018"},{"key":"S1351324922000213_ref72","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000213_ref364","doi-asserted-by":"publisher","DOI":"10.1145\/3209280.3229085"},{"key":"S1351324922000213_ref430","doi-asserted-by":"publisher","DOI":"10.1145\/1753326.1753486"},{"key":"S1351324922000213_ref260","unstructured":"Le, H.P. and Ho, T.V. (2008). A maximum entropy approach to sentence boundary detection of Vietnamese texts. In IEEE International Conference on Research, Innovation and Vision for the Future-RIVF 2008."},{"key":"S1351324922000213_ref197","unstructured":"Henry, T. (2016). Quick ngram processing script. Available from: https:\/\/github.com\/trh3\/NGramProcessing."},{"key":"S1351324922000213_ref396","unstructured":"Shaikh, A. , More, D. , Puttoo, R. , Shrivastav, S. and Shinde, S. (2019). A survey paper on chatbots. International Research Journal of Engineering and Technology (IRJET) 6(4), 1786\u20131789."},{"key":"S1351324922000213_ref94","unstructured":"Chai, C.P. (2017). Statistical Issues in Quantifying Text Mining Performance. PhD Dissertation, Duke University, Durham NC, USA."},{"key":"S1351324922000213_ref179","unstructured":"Grefenstette, G. and Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenisation. Technical report, Rank Xerox Research Centre, Grenoble Laboratory, Meylan, France."},{"key":"S1351324922000213_ref40","doi-asserted-by":"publisher","DOI":"10.1080\/10862967809547290"},{"key":"S1351324922000213_ref343","unstructured":"Peitz, S. , Freitag, M. , Mauser, A. and Ney, H. (2011). Modeling punctuation prediction as machine translation. In International Workshop on Spoken Language Translation (IWSLT)."},{"key":"S1351324922000213_ref142","unstructured":"Elming, J. , Johannsen, A. , Klerke, S. , Lapponi, E. , Alonso, H.M. and S\u00f8gaard, A. (2013). Down-stream effects of tree-to-dependency conversions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 617\u2013626."},{"key":"S1351324922000213_ref245","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-25631-8_56"},{"key":"S1351324922000213_ref398","volume-title":"Text Mining with","author":"Silge","year":"2017"},{"key":"S1351324922000213_ref149","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2014.10.019"},{"key":"S1351324922000213_ref113","doi-asserted-by":"publisher","DOI":"10.1145\/3453692"},{"key":"S1351324922000213_ref201","doi-asserted-by":"publisher","DOI":"10.3115\/992628.992724"},{"key":"S1351324922000213_ref137","unstructured":"Dunning, T. (1994). Statistical identification of language. In Computing Research Laboratory Technical Memo MCCS 94-273. New Mexico State University."},{"key":"S1351324922000213_ref195","doi-asserted-by":"crossref","unstructured":"Hedderich, M.A. , Lange, L. , Adel, H. , Str\u00f6tgen, J. and Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309.","DOI":"10.18653\/v1\/2021.naacl-main.201"},{"key":"S1351324922000213_ref243","doi-asserted-by":"publisher","DOI":"10.1145\/1031171.1031285"},{"key":"S1351324922000213_ref365","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm","year":"2000","journal-title":"Bulletin of the IEEE Computer Society Technical Committee on Data Engineering"},{"key":"S1351324922000213_ref342","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000213_ref91","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2017.4531228"},{"key":"S1351324922000213_ref315","doi-asserted-by":"publisher","DOI":"10.21105\/joss.00655"},{"key":"S1351324922000213_ref447","unstructured":"Wijffels, J. (2021). udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the \u2018UDPipe\u2019 \u2018NLP\u2019 Toolkit. $\\mathsf{R}$ package version 0.8.8. Available from: https:\/\/cran.r-project.org\/web\/packages\/udpipe\/udpipe.pdf."},{"key":"S1351324922000213_ref71","unstructured":"Blei, D.M. and Lafferty, J.D. (2009). Visualizing topics with multi-word expressions. arXiv preprint arXiv:0907.1013."},{"key":"S1351324922000213_ref89","unstructured":"Callan, J. , Hoy, M. , Yoo, C. and Zhao, L. (2009). The ClueWeb09 dataset. Available from: http:\/\/lemurproject.org\/clueweb09\/."},{"key":"S1351324922000213_ref214","unstructured":"Ji, Z. , Wei, Q. and Xu, H. (2020). BERT-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020, 269\u2013277. AMIA stands for American Medical Informatics Association."},{"key":"S1351324922000213_ref135","doi-asserted-by":"crossref","unstructured":"Du, J. , Yu, P. and Zong, C. (2018). Towards computing technologies on machine parsing of English and Chinese garden path sentences. In Proceedings of the Future Technologies Conference. Springer, pp. 806\u2013827.","DOI":"10.1007\/978-3-030-02686-8_60"},{"key":"S1351324922000213_ref283","doi-asserted-by":"publisher","DOI":"10.1002\/asi.5090110403"},{"key":"S1351324922000213_ref65","doi-asserted-by":"crossref","unstructured":"Bernardy, J.-P. and Chatzikyriakidis, S. (2019). What kind of natural language inference are NLP systems learning: Is this enough? In ICAART (2), pp. 919\u2013931. ICAART stands for International Conference on Agents and Artificial Intelligence.","DOI":"10.5220\/0007683509190931"},{"key":"S1351324922000213_ref109","first-page":"1433","article-title":"P-hacking lexical richness through definitions of \u201ctype\u201d and \u201ctoken\u201d","volume":"264","author":"Cohen","year":"2019","journal-title":"Studies in Health Technology and Informatics"},{"key":"S1351324922000213_ref221","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-4003"},{"key":"S1351324922000213_ref183","doi-asserted-by":"publisher","DOI":"10.1075\/term.00013.gro"},{"key":"S1351324922000213_ref226","unstructured":"Kaewphan, S. , Mehryary, F. , Hakala, K. , Salakoski, T. and Ginter, F. (2017). TurkuNLP entry for interactive Bio-ID assignment. In Proceedings of the BioCreative VI Workshop, pp. 32\u201335."},{"key":"S1351324922000213_ref21","doi-asserted-by":"publisher","DOI":"10.3390\/app11031090"},{"key":"S1351324922000213_ref246","unstructured":"Kozakou, E. (2017). Word Adaptions in the Language of Twitter. Master\u2019s Thesis, Leiden University, Leiden, Netherlands."},{"key":"S1351324922000213_ref405","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073475"},{"key":"S1351324922000213_ref468","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-019-0055-0"},{"key":"S1351324922000213_ref276","doi-asserted-by":"publisher","DOI":"10.1145\/2441776.2441918"},{"key":"S1351324922000213_ref4","doi-asserted-by":"crossref","unstructured":"Abraham, A. , Dutta, P. , Mandal, J.K. , Bhattacharya, A. and Dutta, S. (2018). Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2018, Volume 2, vol. 813. Springer. IEMIS Stands for International Conference on Emerging Technologies in Data Mining and Information Security.","DOI":"10.1007\/978-981-13-1951-8"},{"key":"S1351324922000213_ref380","doi-asserted-by":"publisher","DOI":"10.1088\/1757-899X\/874\/1\/012017"},{"key":"S1351324922000213_ref25","unstructured":"Angiani, G. , Ferrari, L. , Fontanini, T. , Fornacciari, P. , Iotti, E. , Magliani, F. and Manicardi, S. (2016). A comparison between preprocessing techniques for sentiment analysis in Twitter. In Proceedings of the 2nd International Workshop on Knowledge Discovery on the WEB (KDWeb)."},{"key":"S1351324922000213_ref466","doi-asserted-by":"crossref","unstructured":"Zhang, L. and Komachi, M. (2018). Neural machine translation of logographic languages using sub-character level information. arXiv preprint arXiv:1809.02694.","DOI":"10.18653\/v1\/W18-6303"},{"key":"S1351324922000213_ref357","first-page":"2577","article-title":"Survey on text transformation using Bi-LSTM in natural language processing with text data","volume":"12","author":"Preethi","year":"2021","journal-title":"Turkish Journal of Computer and Mathematics Education (TURCOMAT)"},{"key":"S1351324922000213_ref10","unstructured":"Agi\u0107, \u017d. , Merkler, D. and Berovi\u0107, D. (2013). Parsing Croatian and Serbian by using Croatian dependency treebanks. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 22\u201333."},{"key":"S1351324922000213_ref265","doi-asserted-by":"publisher","DOI":"10.1145\/3091108"},{"key":"S1351324922000213_ref335","first-page":"384","article-title":"In-depth evaluation of Romanian natural language processing pipelines","volume":"24","author":"Pais","year":"2021","journal-title":"Romanian Journal of Information Science and Technology"},{"key":"S1351324922000213_ref177","unstructured":"Grabar, N. , Zweigenbaum, P. , Soualmia, L. and Darmoni, S. (2003). Matching controlled vocabulary words. In Studies in Health Technology and Informatics, pp. 445\u2013450."},{"key":"S1351324922000213_ref286","unstructured":"Lusetti, M. , Ruzsics, T. , G\u00f6hring, A. , Samard\u017ei\u0107, T. and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, pp. 18\u201328."},{"key":"S1351324922000213_ref406","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2007.05.012"},{"key":"S1351324922000213_ref171","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143889"},{"key":"S1351324922000213_ref454","doi-asserted-by":"publisher","DOI":"10.1007\/s10278-019-00296-y"},{"key":"S1351324922000213_ref473","unstructured":"Zupon, A. , Crew, E. and Ritchie, S. (2021). Text normalization for low-resource languages of Africa. arXiv preprint arXiv:2103.15845."},{"key":"S1351324922000213_ref453","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199605)47:5<357::AID-ASI3>3.0.CO;2-V"},{"key":"S1351324922000213_ref303","unstructured":"Microsoft (2019). Extract n-gram features from text. Available from: https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learni-ng\/studio-module-reference\/extract-n-gram-features-from-text."},{"key":"S1351324922000213_ref332","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1077"},{"key":"S1351324922000213_ref457","doi-asserted-by":"publisher","DOI":"10.1109\/NLPKE.2007.4368029"},{"key":"S1351324922000213_ref382","doi-asserted-by":"crossref","unstructured":"Sadvilkar, N. and Neumann, M. (2020). PySBD: Pragmatic sentence boundary disambiguation. arXiv preprint arXiv:2010.09657.","DOI":"10.18653\/v1\/2020.nlposs-1.15"},{"key":"S1351324922000213_ref394","unstructured":"\u015eeker, G.A. and Eryi\u011fit, G. (2012). Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, pp. 2459\u20132474. CRFs stand for conditional random fields, and COLING stands for International Conference on Computational Linguistics."},{"key":"S1351324922000213_ref30","doi-asserted-by":"publisher","DOI":"10.1145\/2976767.2976769"},{"key":"S1351324922000213_ref202","doi-asserted-by":"publisher","DOI":"10.1075\/la.191.04cab"},{"key":"S1351324922000213_ref60","unstructured":"Benoit, K. , Muhr, D. and Watanabe, K. (2017). stopwords: Multilingual Stopword Lists. $\\mathsf{R}$ package version 0.9.0. Available from: https:\/\/CRAN.R-project.org\/package=stopwords."},{"key":"S1351324922000213_ref6","doi-asserted-by":"crossref","unstructured":"\u00c1cs, J. , K\u00e1d\u00e1r, \u00c1. and Kornai, A. (2021). Subword pooling makes a difference. arXiv preprint arXiv:2102.10864.","DOI":"10.18653\/v1\/2021.eacl-main.194"},{"key":"S1351324922000213_ref96","doi-asserted-by":"publisher","DOI":"10.1080\/09332480.2020.1726112"},{"key":"S1351324922000213_ref52","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638968"},{"key":"S1351324922000213_ref34","unstructured":"Atwood, J. (2008). Stack Overflow Podcast Episode 32. Available from: https:\/\/stackoverflow.blog\/2008\/12\/04\/podcast-32\/."},{"key":"S1351324922000213_ref79","doi-asserted-by":"publisher","DOI":"10.3115\/1699571.1699573"},{"key":"S1351324922000213_ref230","first-page":"7","article-title":"Preprocessing techniques for text mining","volume":"5","author":"Kannan","year":"2014","journal-title":"International Journal of Computer Science and Communication Networks"},{"key":"S1351324922000213_ref229","first-page":"123","volume-title":"Workshop of the Cross-Language Evaluation Forum for European Languages","author":"Kamps","year":"2004"},{"key":"S1351324922000213_ref192","doi-asserted-by":"publisher","DOI":"10.3115\/1621431.1621436"},{"key":"S1351324922000213_ref345","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2011.5947570"},{"key":"S1351324922000213_ref254","unstructured":"Lafferty, J.D. and Blei, D.M. (2006). Correlated topic models. In Advances in Neural Information Processing Systems, pp. 147\u2013154."},{"key":"S1351324922000213_ref306","unstructured":"Miller, J.-A. (2014). Language: Much ado about what? In Lacan and the Subject of Language, pp. 21\u201335."},{"key":"S1351324922000213_ref368","first-page":"76","article-title":"A survey of stemming algorithms for information retrieval","volume":"17","author":"Rajput","year":"2015","journal-title":"International Organization of Scientific Research \u2013 Journal of Computer Engineering"},{"key":"S1351324922000213_ref381","doi-asserted-by":"publisher","DOI":"10.1037\/0033-2909.88.2.413"},{"key":"S1351324922000213_ref426","doi-asserted-by":"publisher","DOI":"10.1093\/jos\/ffx017"},{"key":"S1351324922000213_ref451","doi-asserted-by":"crossref","unstructured":"Xue, J. , Chen, J. , Hu, R. , Chen, C. , Zheng, C. and Zhu, T. (2020). Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach. arXiv preprint arXiv: 2005.12830.","DOI":"10.2196\/20550"},{"key":"S1351324922000213_ref146","doi-asserted-by":"publisher","DOI":"10.2200\/S00999ED3V01Y202003HLT046"},{"key":"S1351324922000213_ref15","first-page":"1","article-title":"Automatic learning of Arabic text categorization","volume":"2","author":"Al-Molegi","year":"2015","journal-title":"International Journal of Digital Contents and Applications"},{"key":"S1351324922000213_ref16","doi-asserted-by":"publisher","DOI":"10.26615\/978-954-452-072-4_007"},{"key":"S1351324922000213_ref1","doi-asserted-by":"publisher","DOI":"10.2196\/19016"},{"key":"S1351324922000213_ref321","unstructured":"Nayel, H.A. , Shashirekha, H. , Shindo, H. and Matsumoto, Y. (2019). Improving multi-word entity recognition for biomedical texts. arXiv preprint arXiv:1908.05691."},{"key":"S1351324922000213_ref150","first-page":"135","article-title":"How NLP can improve question answering","volume":"29","author":"Ferret","year":"2002","journal-title":"Knowledge Organization"},{"key":"S1351324922000213_ref125","doi-asserted-by":"publisher","DOI":"10.1017\/pan.2017.44"},{"key":"S1351324922000213_ref92","unstructured":"Campbell, W.M. , Li, L. , Dagli, C. , Acevedo-Aviles, J. , Geyer, K. , Campbell, J.P. and Priebe, C. (2016). Cross-domain entity resolution in social media. arXiv preprint arXiv:1608.01386."},{"key":"S1351324922000213_ref87","unstructured":"Cabot, C. , Soualmia, L.F. , Dahamna, B. and Darmoni, S.J. (2016). SIBM at CLEF eHealth Evaluation Lab 2016: Extracting concepts in French medical texts with ECMT and CIMIND. In CLEF (Working Notes), pp. 47\u201360. CLEF stands for Conference and Labs of the Evaluation Forum."},{"key":"S1351324922000213_ref169","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00032"},{"key":"S1351324922000213_ref374","unstructured":"Rikters, M. and Bojar, O. (2017). Paying attention to multi-word expressions in neural machine translation. arXiv preprint arXiv:1710.06313."},{"key":"S1351324922000213_ref416","volume-title":"Python Natural Language Processing","author":"Thanaki","year":"2017"},{"key":"S1351324922000213_ref474","unstructured":"Zweigenbaum, P. and Grabar, N. (2002a). Accenting unknown words: Application to the French version of the MeSH. In Workshop NLP in Biomedical Applications, EFMI, Cyprus, 69\u00e1. EFMI stands for European Federation for Medical Informatics."},{"key":"S1351324922000213_ref273","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139084789"},{"key":"S1351324922000213_ref275","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-39940-9"},{"key":"S1351324922000213_ref242","doi-asserted-by":"publisher","DOI":"10.5121\/ijaia.2012.3208"},{"key":"S1351324922000213_ref302","unstructured":"Miah, M. (2009). Improved k-NN algorithm for text classification. In Proceedings of the 2009 International Conference on Data Mining (DMIN). Citeseer, pp. 434\u2013440."},{"key":"S1351324922000213_ref259","doi-asserted-by":"publisher","DOI":"10.1108\/14684520710841829"},{"key":"S1351324922000213_ref472","doi-asserted-by":"publisher","DOI":"10.1002\/asi.23186"},{"key":"S1351324922000213_ref317","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-34614-0_5"},{"key":"S1351324922000213_ref449","doi-asserted-by":"crossref","unstructured":"Wong, D.F. , Chao, L.S. and Zeng, X. (2014). iSentenizer- $\\mu$ : Multilingual sentence boundary detection model. The Scientific World Journal 2014, 1\u201310.","DOI":"10.1155\/2014\/196574"},{"key":"S1351324922000213_ref101","doi-asserted-by":"publisher","DOI":"10.1109\/ICALIP.2016.7846525"},{"key":"S1351324922000213_ref176","volume-title":"Regular Expressions Cookbook","author":"Goyvaerts","year":"2012"},{"key":"S1351324922000213_ref128","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W15-2605"},{"key":"S1351324922000213_ref147","doi-asserted-by":"publisher","DOI":"10.1109\/SIET.2017.8304154"},{"key":"S1351324922000213_ref70","unstructured":"Blanco, E. and Moldovan, D. (2011). Some issues on detecting negation from text. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society (FLAIRS) Conference."},{"key":"S1351324922000213_ref316","unstructured":"Munro, R. (2015). Language at ACL this year. ACL stands for the annual conference of the Association for Computational Linguistics. Available from: http:\/\/www.junglelightspeed.com\/languages-at-acl-this-year\/."},{"key":"S1351324922000213_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-3003"},{"key":"S1351324922000213_ref249","doi-asserted-by":"publisher","DOI":"10.26615\/issn.2603-2821.2021_013"},{"key":"S1351324922000213_ref172","unstructured":"Goldberg, Y. and Orwant, J. (2013). A dataset of syntactic-ngrams over time from a very large corpus of English books. Technical report, Google Research."},{"key":"S1351324922000213_ref324","unstructured":"Ng, D. , Bansal, M. and Curran, J.R. (2015). Web-scale surface and syntactic n-gram features for dependency parsing. arXiv preprint arXiv:1502.07038."},{"key":"S1351324922000213_ref397","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-018-9340-3"},{"key":"S1351324922000213_ref123","article-title":"VecText: Converting documents to vectors","volume":"46","author":"Da\u0159ena","year":"2019","journal-title":"IAENG stands for International Association of Engineers."},{"key":"S1351324922000213_ref81","unstructured":"Briscoe, T. (1996). The syntax and semantics of punctuation and its use in interpretation. In Proceedings of the Association for Computational Linguistics Workshop on Punctuation, pp. 1\u20137."},{"key":"S1351324922000213_ref433","doi-asserted-by":"publisher","DOI":"10.29007\/gln9"},{"key":"S1351324922000213_ref465","doi-asserted-by":"publisher","DOI":"10.1016\/j.autcon.2016.08.027"},{"key":"S1351324922000213_ref77","doi-asserted-by":"crossref","unstructured":"Bollmann, M. (2019). A large-scale comparison of historical text normalization systems. arXiv preprint arXiv:1904.02036.","DOI":"10.18653\/v1\/N19-1389"},{"key":"S1351324922000213_ref323","doi-asserted-by":"publisher","DOI":"10.1080\/00107510500052444"},{"key":"S1351324922000213_ref410","doi-asserted-by":"publisher","DOI":"10.1111\/j.1468-0394.2010.00575.x"},{"key":"S1351324922000213_ref459","doi-asserted-by":"publisher","DOI":"10.1109\/IFITA.2010.73"},{"key":"S1351324922000213_ref444","volume-title":"Neural Representations of Natural Language","volume":"783","author":"White","year":"2018"},{"key":"S1351324922000213_ref117","unstructured":"Councill, I. , McDonald, R. and Velikovich, L. (2010). What\u2019s great and what\u2019s not: Learning to classify the scope of negation for improved sentiment analysis. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 51\u201359."},{"key":"S1351324922000213_ref319","first-page":"607","article-title":"Predicting mental illness using social media posts and comments","volume":"11","author":"Nasir","year":"2020","journal-title":"International Journal of Advanced Computer Science and Applications"},{"key":"S1351324922000213_ref184","unstructured":"Groza, T. and Verspoor, K. (2014). Automated generation of test suites for error analysis of concept recognition systems. In Proceedings of the Australasian Language Technology Association Workshop 2014, pp. 23\u201331."},{"key":"S1351324922000213_ref162","unstructured":"Frigau, L. , Wu, Q. and Banks, D. (2021). Optimizing the JSM program. Journal of the American Statistical Association 00(0), 1\u201310. JSM stands for Joint Statistical Meetings."},{"key":"S1351324922000213_ref203","doi-asserted-by":"publisher","DOI":"10.1007\/11861461_10"},{"key":"S1351324922000213_ref187","unstructured":"Habert, B. , Adda, G. , Adda-Decker, M. , de Mar\u00ebuil, P.B. , Ferrari, S. , Ferret, O. , Illouz, G. and Paroubek, P. (1998). Towards tokenization evaluation. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pp. 427\u2013431."},{"key":"S1351324922000213_ref213","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-1002"},{"key":"S1351324922000213_ref290","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"key":"S1351324922000213_ref126","unstructured":"Deoras, A. , Mikolov, T. and Church, K. (2011). A fast re-scoring strategy to capture long-distance dependencies. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1116\u20131127."},{"key":"S1351324922000213_ref463","unstructured":"Zhang, C. , Baldwin, T. , Ho, H. , Kimelfeld, B. and Li, Y. (2013). Adaptive parser-centric text normalization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1159\u20131168."},{"key":"S1351324922000213_ref90","doi-asserted-by":"crossref","unstructured":"Camacho-Collados, J. and Pilehvar, M.T. (2018). On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 40\u201346. EMNLP stands for Conference on Empirical Methods in Natural Language Processing.","DOI":"10.18653\/v1\/W18-5406"},{"key":"S1351324922000213_ref404","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11197"},{"key":"S1351324922000213_ref256","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48941-4_5"},{"key":"S1351324922000213_ref206","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935857"},{"key":"S1351324922000213_ref372","doi-asserted-by":"publisher","DOI":"10.3115\/974557.974561"},{"key":"S1351324922000213_ref220","unstructured":"Jones, R. (2006). Internet Slang Dictionary. Durham NC, USA: Lulu.com."},{"key":"S1351324922000213_ref386","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-96292-4_18"},{"key":"S1351324922000213_ref119","doi-asserted-by":"publisher","DOI":"10.1177\/0165551504042805"},{"key":"S1351324922000213_ref296","unstructured":"McAuliffe, J.D. and Blei, D.M. (2008). Supervised topic models. In Advances in Neural Information Processing Systems, pp. 121\u2013128."},{"key":"S1351324922000213_ref255","unstructured":"Lahiri, S. and Mihalcea, R. (2013). Using n-gram and word network features for native language identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 251\u2013259."},{"key":"S1351324922000213_ref320","first-page":"2319","article-title":"Survey on pre-processing techniques for text mining","author":"Nayak","year":"2016","journal-title":"International Journal of Engineering and Computer Science"},{"key":"S1351324922000213_ref334","unstructured":"Paice, C. and Hooper, R. (2005). Lancaster stemmer. Available from: http:\/\/www.comp.lancs.ac.uk\/computing\/research\/stemming\/[last accessed August 2019]."},{"key":"S1351324922000213_ref325","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00140"},{"key":"S1351324922000213_ref268","doi-asserted-by":"publisher","DOI":"10.21236\/ADA477571"},{"key":"S1351324922000213_ref340","doi-asserted-by":"publisher","DOI":"10.5120\/19418-0910"},{"key":"S1351324922000213_ref350","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1326"},{"key":"S1351324922000213_ref232","doi-asserted-by":"publisher","DOI":"10.1201\/9780367808495-7"},{"key":"S1351324922000213_ref379","unstructured":"Rosenthal, S. and McKeown, K. (2013). Columbia NLP: Sentiment detection of subjective phrases in social media. In Second Joint Conference on Lexical and Computational Semantics (SEM): Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 478\u2013482."},{"key":"S1351324922000213_ref355","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2017.05.019"},{"key":"S1351324922000213_ref244","first-page":"453","volume-title":"International Conference on Applications of Natural Language to Information Systems","author":"Koto","year":"2015"},{"key":"S1351324922000213_ref420","doi-asserted-by":"publisher","DOI":"10.1002\/9781119004752"},{"key":"S1351324922000213_ref73","doi-asserted-by":"crossref","unstructured":"Bodapati, S. , Yun, H. and Al-Onaizan, Y. (2019). Robustness to capitalization errors in named entity recognition. arXiv preprint arXiv:1911.05241.","DOI":"10.18653\/v1\/D19-5531"},{"key":"S1351324922000213_ref36","doi-asserted-by":"publisher","DOI":"10.1109\/ICCRD.2011.5764181"},{"key":"S1351324922000213_ref14","first-page":"44","article-title":"Efficient algorithms for preprocessing and stemming of tweets in a sentiment analysis system","volume":"9","author":"Al-Khafaji","year":"2017","journal-title":"International Organization of Scientific Research \u2013 Journal of Computer Engineering"},{"key":"S1351324922000213_ref314","unstructured":"Mubarak, H. (2017). Build fast and accurate lemmatization for Arabic. arXiv preprint arXiv:1710.06700."},{"key":"S1351324922000213_ref458","doi-asserted-by":"publisher","DOI":"10.1007\/s10772-018-9521-x"},{"key":"S1351324922000213_ref250","unstructured":"Kutuzov, A. , Fares, M. , Oepen, S. and Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 58th Conference on Simulation and Modelling, Link\u00f6ping, Sweden. Link\u00f6ping University Electronic Press, pp. 271\u2013276."},{"key":"S1351324922000213_ref370","doi-asserted-by":"publisher","DOI":"10.1080\/19312458.2018.1555798"},{"key":"S1351324922000213_ref371","unstructured":"\u0158eh\u016f\u0159ek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45\u201350. LREC stands for Conference on Language Resources and Evaluation."},{"key":"S1351324922000213_ref134","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1128"},{"key":"S1351324922000213_ref464","doi-asserted-by":"publisher","DOI":"10.1109\/WCRE.2008.37"},{"key":"S1351324922000213_ref419","unstructured":"Torres-Moreno, J.-M. (2012). Beyond stemming and lemmatization: Ultra-stemming to improve automatic text summarization. arXiv preprint arXiv:1209.3126."},{"key":"S1351324922000213_ref240","doi-asserted-by":"crossref","unstructured":"Kitaev, N. , Cao, S. and Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760.","DOI":"10.18653\/v1\/P19-1340"},{"key":"S1351324922000213_ref185","first-page":"1222","article-title":"A closer look at skip-gram modelling","volume":"6","author":"Guthrie","year":"2006","journal-title":"In LREC (International Conference on Language Resources and Evaluation)"},{"key":"S1351324922000213_ref434","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143967"},{"key":"S1351324922000213_ref24","doi-asserted-by":"publisher","DOI":"10.4324\/9780203167502"},{"key":"S1351324922000213_ref39","doi-asserted-by":"publisher","DOI":"10.1093\/oxfordhb\/9780199591428.001.0001"},{"key":"S1351324922000213_ref363","unstructured":"Raff, E. , Fleming, W. , Zak, R. , Anderson, H. , Finlayson, B. , Nicholas, C. and McLean, M. (2019). Kilograms: Very large n-grams for malware classification. arXiv preprint arXiv:1908.00200."},{"key":"S1351324922000213_ref234","unstructured":"Kaufmann, M. and Kalita, J. (2010). Syntactic normalization of Twitter messages. In International Conference on Natural Language Processing, Kharagpur, India."},{"key":"S1351324922000213_ref130","doi-asserted-by":"crossref","unstructured":"Dodge, J. , Gururangan, S. , Card, D. , Schwartz, R. and Smith, N.A. (2019). Show your work: Improved reporting of experimental results. arXiv preprint arXiv:1909.03004.","DOI":"10.18653\/v1\/D19-1224"},{"key":"S1351324922000213_ref194","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pbio.1002106"},{"key":"S1351324922000213_ref471","unstructured":"Zipf, G.K. (1949). Human behavior and the principle of least effort."},{"key":"S1351324922000213_ref17","doi-asserted-by":"publisher","DOI":"10.3389\/frai.2020.00042"},{"key":"S1351324922000213_ref110","doi-asserted-by":"crossref","unstructured":"Cohen, K.B. , Ogren, P.V. , Fox, L. and Hunter, L. (2005). Empirical data on corpus design and usage in biomedical natural language processing. In American Medical Informatics Association (AMIA) Annual Symposium Proceedings, vol. 2005, p. 156.","DOI":"10.3115\/1641484.1641490"},{"key":"S1351324922000213_ref78","unstructured":"Bouchet-Valat, M. (2019). SnowballC: Snowball Stemmers Based on the C \u2018libstemmer\u2019 UTF-8 Library. $\\mathsf{R}$ package version 0.6.0. Available from: https:\/\/CRAN.R-project.org\/package=SnowballC."},{"key":"S1351324922000213_ref145","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-2010"},{"key":"S1351324922000213_ref269","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073465"},{"key":"S1351324922000213_ref8","doi-asserted-by":"publisher","DOI":"10.1177\/1847979019890771"},{"key":"S1351324922000213_ref148","unstructured":"Feinerer, I. and Hornik, K. (2018). tm: Text Mining Package. $\\mathsf{R}$ package version 0.7.6. Available from: https:\/\/CRAN.R-project.org\/package=tm."},{"key":"S1351324922000213_ref207","doi-asserted-by":"crossref","unstructured":"Hutto, C. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 8, pp. 216\u2013225. AAAI stands for Association for the Advancement of Artificial Intelligence.","DOI":"10.1609\/icwsm.v8i1.14550"},{"key":"S1351324922000213_ref411","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.383"},{"key":"S1351324922000213_ref441","first-page":"365","article-title":"N-grams based feature selection and text representation for Chinese text classification","volume":"2","author":"Wei","year":"2009","journal-title":"International Journal of Computational Intelligence Systems"},{"key":"S1351324922000213_ref267","unstructured":"Li, L. , Shao, Y. , Song, D. , Qiu, X. and Huang, X. (2020). Generating adversarial examples in Chinese texts using sentence-pieces. arXiv preprint arXiv:2012.14769."},{"key":"S1351324922000213_ref295","unstructured":"Matusov, E. , Leusch, G. , Bender, O. and Ney, H. (2005). Evaluating machine translation output with automatic sentence segmentation. In International Workshop on Spoken Language Translation (IWSLT)."},{"key":"S1351324922000213_ref180","doi-asserted-by":"publisher","DOI":"10.1075\/ijcl.15.4.04gri"},{"key":"S1351324922000213_ref442","doi-asserted-by":"publisher","DOI":"10.1080\/19312458.2017.1387238"},{"key":"S1351324922000213_ref188","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0232525"},{"key":"S1351324922000213_ref417","unstructured":"Thione, G. and van den Berg, M. (2007). Systems and methods for structural indexing of natural language text. Google Patents. US Patent Application No. 11\/405,385."},{"key":"S1351324922000213_ref258","doi-asserted-by":"publisher","DOI":"10.1080\/0907676X.1996.9961277"},{"key":"S1351324922000213_ref182","unstructured":"Grobelnik, M. and Mladenic, D. (2004). Text-mining tutorial. Technical report, Jo\u017eef Stefan Institute (JSI), Slovenia."},{"key":"S1351324922000213_ref336","unstructured":"Pak, A. and Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), pp. 1320\u20131326."},{"key":"S1351324922000213_ref333","unstructured":"Olde, B.A. , Hoeffner, J. , Chipman, P. , Graesser, A.C. and Research Group, Tutoring (1999). A connectionist model for part of speech tagging. In Florida Artificial Intelligence Research Society (FLAIRS) Conference, pp. 172\u2013176."},{"key":"S1351324922000213_ref462","doi-asserted-by":"publisher","DOI":"10.1109\/APSIPA.2017.8282279"},{"key":"S1351324922000213_ref346","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"S1351324922000213_ref413","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-20244-5_33"},{"key":"S1351324922000213_ref467","unstructured":"Zhang, X. and LeCun, Y. (2017). Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv preprint arXiv:1708.02657."},{"key":"S1351324922000213_ref85","unstructured":"Buck, C. , Heafield, K. and Van Ooyen, B. (2014). N-gram counts and language models from the common crawl. In LREC (International Conference on Language Resources and Evaluation), pp. 3579\u20133584."},{"key":"S1351324922000213_ref7","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0254-8"},{"key":"S1351324922000213_ref337","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220190"},{"key":"S1351324922000213_ref29","unstructured":"Armengol Estap\u00e9, J. (2021). A Pipeline for Large Raw Text Preprocessing and Model Training of Language Models at Scale. Master\u2019s Thesis, Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain."},{"key":"S1351324922000213_ref32","first-page":"295","volume-title":"International Conference on Applications of Natural Language to Information Systems","author":"Ashok","year":"2019"},{"key":"S1351324922000213_ref284","doi-asserted-by":"publisher","DOI":"10.1111\/faf.12399"},{"key":"S1351324922000213_ref322","unstructured":"N\u00e9v\u00e9ol, A. , Robert, A. , Anderson, R. , Cohen, K.B. , Grouin, C. , Lavergne, T. , Rey, G. , Rondet, C. and Zweigenbaum, P. (2017). CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certificates in English and French. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum."},{"key":"S1351324922000213_ref62","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452389"},{"key":"S1351324922000213_ref190","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-3925-4_3"},{"key":"S1351324922000213_ref42","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/N15-1045"},{"key":"S1351324922000213_ref54","doi-asserted-by":"publisher","DOI":"10.1145\/1229179.1229183"},{"key":"S1351324922000213_ref223","unstructured":"Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edn., Fort Collins, CO, USA: Prentice Hall Series in Artificial Intelligence."},{"key":"S1351324922000213_ref131","unstructured":"Domingo, M. , Garca-Martnez, M. , Helle, A. , Casacuberta, F. and Herranz, M. (2018). How much does tokenization affect neural machine translation? arXiv preprint arXiv:1812.08621."},{"key":"S1351324922000213_ref99","volume-title":"Python Social Media Analytics","author":"Chatterjee","year":"2017"},{"key":"S1351324922000213_ref53","unstructured":"Beaver, I. (2019). pycontractions 2.0.1. Python library. Available from: https:\/\/pypi.org\/project\/pycontractions\/."},{"key":"S1351324922000213_ref35","unstructured":"Au, T.C. (2014). Topics in Computational Advertising. PhD Dissertation, Duke University, Durham NC, USA."},{"key":"S1351324922000213_ref153","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781107053779"},{"key":"S1351324922000213_ref69","doi-asserted-by":"publisher","DOI":"10.1016\/j.wpi.2007.02.002"},{"key":"S1351324922000213_ref104","unstructured":"Chua, M. , Van Esch, D. , Coccaro, N. , Cho, E. , Bhandari, S. and Jia, L. (2018). Text normalization infrastructure that scales to hundreds of language varieties. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)."},{"key":"S1351324922000213_ref112","unstructured":"Cohen, K.B. , Tanabe, L. , Kinoshita, S. and Hunter, L. (2004). A resource for constructing customized test suites for molecular biology entity identification systems. In HLT-NAACL 2004 Workshop: Linking Biological Literature, Ontologies and Databases, pp. 1\u20138. HLT-NAACL stands for Human Language Technologies \u2013 North American Chapter of the Association for Computational Linguistics."},{"key":"S1351324922000213_ref448","unstructured":"Wilson, K.S. (2009). Database search control. US Patent Application No. 12\/031,701."},{"key":"S1351324922000213_ref61","doi-asserted-by":"publisher","DOI":"10.21105\/joss.00774"},{"key":"S1351324922000213_ref80","unstructured":"Brants, T. , Popat, A.C. , Xu, P. , Och, F.J. and Dean, J. (2007). Large language models in machine translation. Technical report, Google Research."},{"key":"S1351324922000213_ref401","first-page":"35","article-title":"Modern information retrieval: A brief overview","volume":"24","author":"Singhal","year":"2001","journal-title":"Bulletin of the IEEE Computer Society Technical Committee on Data Engineering"},{"key":"S1351324922000213_ref247","doi-asserted-by":"crossref","unstructured":"Kudo, T. and Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.","DOI":"10.18653\/v1\/D18-2012"},{"key":"S1351324922000213_ref76","unstructured":"Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 11\u201318."},{"key":"S1351324922000213_ref422","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277917"},{"key":"S1351324922000213_ref84","first-page":"467","article-title":"Class-based n-gram models of natural language","volume":"18","author":"Brown","year":"1992","journal-title":"Computational Linguistics"},{"key":"S1351324922000213_ref312","doi-asserted-by":"publisher","DOI":"10.1186\/gb-2008-9-s2-s3"},{"key":"S1351324922000213_ref219","doi-asserted-by":"publisher","DOI":"10.3115\/991886.991960"},{"key":"S1351324922000213_ref263","doi-asserted-by":"publisher","DOI":"10.1109\/STARTUP.2016.7583963"},{"key":"S1351324922000213_ref423","volume-title":"Eats, Shoots and Leaves: The Zero Tolerance Approach to Punctuation","author":"Truss","year":"2004"},{"key":"S1351324922000213_ref274","doi-asserted-by":"publisher","DOI":"10.1186\/2041-1480-3-3"},{"key":"S1351324922000213_ref11","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-99344-7"},{"key":"S1351324922000213_ref198","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009761603038"},{"key":"S1351324922000213_ref38","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2015.02.031"},{"key":"S1351324922000213_ref354","doi-asserted-by":"publisher","DOI":"10.1108\/eb046814"},{"key":"S1351324922000213_ref299","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4817"},{"key":"S1351324922000213_ref358","volume-title":"Computational Linguistics: Applications","volume":"458","author":"Przepi\u00f3rkowski","year":"2012"},{"key":"S1351324922000213_ref105","doi-asserted-by":"publisher","DOI":"10.1109\/WI.2018.00008"},{"key":"S1351324922000213_ref310","doi-asserted-by":"publisher","DOI":"10.2197\/ipsjjip.29.490"},{"key":"S1351324922000213_ref140","unstructured":"Ek, A. , Bernardy, J.-P. and Chatzikyriakidis, S. (2020). How does punctuation affect neural models in natural language inference. In Proceedings of the Probability and Meaning Conference (PaM 2020), pp. 109\u2013116."},{"key":"S1351324922000213_ref175","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-18818-8_11"},{"key":"S1351324922000213_ref384","volume-title":"Multiword Expressions: Insights from a Multi-Lingual Perspective","author":"Sailer","year":"2018"},{"key":"S1351324922000213_ref44","unstructured":"Bardoel, T. (2012). Comparing n-gram frequency distributions. Technical report, Tilburg center for Cognition and Communication (TiCC), Tilburg University, Tilburg, Netherlands."},{"key":"S1351324922000213_ref421","unstructured":"Tran, D. and Sharma, D. (2005). Markov models for written language identification. In Proceedings of the 12th International Conference on Neural Information Processing, pp. 67\u201370."},{"key":"S1351324922000213_ref369","doi-asserted-by":"crossref","first-page":"15","DOI":"10.5120\/ijca2016911462","article-title":"Stop-word removal algorithm and its implementation for Sanskrit language","volume":"150","author":"Raulji","year":"2016","journal-title":"International Journal of Computer Applications"},{"key":"S1351324922000213_ref9","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-3223-4"},{"key":"S1351324922000213_ref205","doi-asserted-by":"publisher","DOI":"10.1145\/2559168"},{"key":"S1351324922000213_ref136","doi-asserted-by":"publisher","DOI":"10.3115\/1273073.1273096"},{"key":"S1351324922000213_ref200","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1158"},{"key":"S1351324922000213_ref235","article-title":"A systematic review on stopword removal algorithms","volume":"4","author":"Kaur","year":"2018","journal-title":"International Journal on Future Revolution in Computer Science and Communication Engineering"},{"key":"S1351324922000213_ref279","doi-asserted-by":"publisher","DOI":"10.1515\/9783110682564-003"},{"key":"S1351324922000213_ref445","doi-asserted-by":"publisher","DOI":"10.1145\/1277741.1277787"},{"key":"S1351324922000213_ref86","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-69805-2_30"},{"key":"S1351324922000213_ref391","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-2069"},{"key":"S1351324922000213_ref22","doi-asserted-by":"publisher","DOI":"10.1109\/RWEEK.2015.7287440"},{"key":"S1351324922000213_ref95","doi-asserted-by":"publisher","DOI":"10.29115\/SP-2018-0035"},{"key":"S1351324922000213_ref291","unstructured":"Mansurov, B. and Mansurov, A. (2021). Uzbek Cyrillic-Latin-Cyrillic machine transliteration. arXiv preprint arXiv:2101.05162."},{"key":"S1351324922000213_ref415","unstructured":"Temnikova, I. , Nikolova, I. , Baumgartner, W.A. Jr , Angelova, G. and Cohen, K.B. (2013). Closure properties of Bulgarian clinical text. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), pp. 667\u2013675."},{"key":"S1351324922000213_ref233","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-15-6876-3_31"},{"key":"S1351324922000213_ref154","doi-asserted-by":"publisher","DOI":"10.1111\/rssa.12424"},{"key":"S1351324922000213_ref181","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2015.68"},{"key":"S1351324922000213_ref248","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-4267-4"},{"key":"S1351324922000213_ref293","first-page":"145","article-title":"Multi-word expressions between syntax and the lexicon: The case of Italian verb-particle constructions","volume":"18","author":"Masini","year":"2005","journal-title":"SKY stands for Suomen kielitieteellinen yhdistys, from the Linguistic Association of Finland."},{"key":"S1351324922000213_ref164","unstructured":"Gabernet, A.R. and Limburn, J. (2017). Breaking the 80\/20 rule: How data catalogs transform data scientists\u2019 productivity. Available from: https:\/\/www.ibm.com\/cloud\/blog\/ibm-data-catalog-data-scientists-productivity."},{"key":"S1351324922000213_ref395","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-007-0134-2"},{"key":"S1351324922000213_ref443","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220681"},{"key":"S1351324922000213_ref152","unstructured":"Finlayson, M. and Kulkarni, N. (2011). Detecting multi-word expressions improves word sense disambiguation. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, pp. 20\u201324."},{"key":"S1351324922000213_ref55","unstructured":"Bekkerman, R. and Allan, J. (2004). Using bigrams in text categorization. Technical report, IR-408, Center of Intelligent Information Retrieval, University of Massachusetts at Amherst, Amherst MA, United States."},{"key":"S1351324922000213_ref252","doi-asserted-by":"publisher","DOI":"10.1002\/9781119282105"},{"key":"S1351324922000213_ref313","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2013.01.019"},{"key":"S1351324922000213_ref344","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1110"},{"key":"S1351324922000213_ref47","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-00958-7_69"},{"key":"S1351324922000213_ref23","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-95663-3_4"},{"key":"S1351324922000213_ref209","doi-asserted-by":"publisher","DOI":"10.1201\/9781420085938"},{"key":"S1351324922000213_ref63","doi-asserted-by":"publisher","DOI":"10.14232\/actacyb.23.3.2018.5"},{"key":"S1351324922000213_ref151","doi-asserted-by":"publisher","DOI":"10.3390\/info12020052"},{"key":"S1351324922000213_ref253","doi-asserted-by":"publisher","DOI":"10.1007\/s00426-021-01540-3"},{"key":"S1351324922000213_ref359","first-page":"4","article-title":"Brute-force sentence pattern extortion from harmful messages for cyberbullying detection","volume":"20","author":"Ptaszynski","year":"2019","journal-title":"Journal of the Association for Information Systems"},{"key":"S1351324922000213_ref327","unstructured":"Nivre, J. (2005). Dependency grammar and dependency parsing. MSI Report 5133(1959), 1\u201332. MSI report is from V\u00e4xj\u00f6 University, V\u00e4xj\u00f6, Sweden."},{"key":"S1351324922000213_ref377","doi-asserted-by":"publisher","DOI":"10.1111\/ajps.12103"},{"key":"S1351324922000213_ref212","volume-title":"The Lexicon: An Introduction","author":"Je\u017eek","year":"2016"},{"key":"S1351324922000213_ref475","doi-asserted-by":"publisher","DOI":"10.1016\/S1386-5056(02)00056-4"},{"key":"S1351324922000213_ref231","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-84628-754-1"},{"key":"S1351324922000213_ref227","doi-asserted-by":"publisher","DOI":"10.1016\/j.cognition.2004.01.002"},{"key":"S1351324922000213_ref191","unstructured":"Hansen, C. , Hansen, C. , Simonsen, J.G. and Lioma, C. (2018). The Copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the CLEF-2018 CheckThat! lab. In CLEF (Working Notes). CLEF stands for Conference and Labs of the Evaluation Forum."},{"key":"S1351324922000213_ref108","doi-asserted-by":"publisher","DOI":"10.3115\/1118149.1118152"},{"key":"S1351324922000213_ref362","unstructured":"Qudar, M. and Mago, V. (2020). A survey on language models. Technical report, Lakehead University, Thunder Bay, Ontario, Canada."},{"key":"S1351324922000213_ref318","unstructured":"Narala, S. , Rani, B.P. and Ramakrishna, K. (2017). Telugu text categorization using language models. Global Journal of Computer Science and Technology 16(4), 9\u201313."},{"key":"S1351324922000213_ref127","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805."},{"key":"S1351324922000213_ref330","unstructured":"Nugent, R. (2020). Instead of just teaching data science, let\u2019s understand how and why people do it. Symposium on Data Science and Statistics. Abstract available from: https:\/\/ww2.amstat.org\/meetings\/sdss\/2020\/onlineprogram\/AbstractDetails.cfm?AbstractID=308230."},{"key":"S1351324922000213_ref210","doi-asserted-by":"publisher","DOI":"10.1017\/S0269888914000277"},{"key":"S1351324922000213_ref376","doi-asserted-by":"crossref","unstructured":"Rinker, T.W. (2018b). textclean: Text Cleaning Tools. $\\mathsf{R}$ package version 0.9.3. Available from: https:\/\/github.com\/trinker\/textclean.","DOI":"10.32614\/CRAN.package.textclean"},{"key":"S1351324922000213_ref114","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00302"},{"key":"S1351324922000213_ref266","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911499"},{"key":"S1351324922000213_ref349","unstructured":"Petic, M. and G\u00eefu, D. (2014). Transliteration and alignment of parallel texts from Cyrillic to Latin. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), pp. 1819\u20131823."},{"key":"S1351324922000213_ref436","doi-asserted-by":"publisher","DOI":"10.1145\/3444370.3444557"},{"key":"S1351324922000213_ref440","doi-asserted-by":"publisher","DOI":"10.3115\/992424.992434"},{"key":"S1351324922000213_ref41","first-page":"267","article-title":"Multiword expressions","volume":"2","author":"Baldwin","year":"2010","journal-title":"Handbook of Natural Language Processing"},{"key":"S1351324922000213_ref120","unstructured":"CrowdFlower (2017). 2017 Data Science Report. Available from: https:\/\/visit.figure-eight.com\/rs\/416-ZBE-142\/images\/CrowdFlower_DataScienceReport.pdf."},{"key":"S1351324922000213_ref452","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W15-4320"},{"key":"S1351324922000213_ref129","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-72347-1"},{"key":"S1351324922000213_ref469","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2019.06.020"},{"key":"S1351324922000213_ref118","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-73742-3"},{"key":"S1351324922000213_ref122","doi-asserted-by":"publisher","DOI":"10.1109\/T4E.2011.21"},{"key":"S1351324922000213_ref309","unstructured":"Moon, S. and Okazaki, N. (2020). Jamo pair encoding: Subcharacter representation-based extreme Korean vocabulary compression for efficient subword tokenization. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3490\u20133497."},{"key":"S1351324922000213_ref393","unstructured":"Schuur, Y. (2020). Normalization for Dutch for Improved POS Tagging. Master\u2019s Thesis, University of Groningen, Groningen, Netherlands."},{"key":"S1351324922000213_ref456","unstructured":"Yeh, K.C. (2003). Bilingual sentence alignment based on punctuation marks. In Proceedings of the ROCLING 2003 Student Workshop, pp. 303\u2013312. ROCLING stands for Conference on Computational Linguistics and Speech Processing."},{"key":"S1351324922000213_ref331","volume-title":"Center for the Study of Language (CSLI)","volume":"18","author":"Nunberg","year":"1990"},{"key":"S1351324922000213_ref57","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00041"},{"key":"S1351324922000213_ref132","doi-asserted-by":"publisher","DOI":"10.1023\/A:1023553801115"},{"key":"S1351324922000213_ref237","first-page":"350","article-title":"An interpretation of lemmatization and stemming in natural language processing","volume":"22","author":"Khyani","year":"2020","journal-title":"Journal of University of Shanghai for Science and Technology"},{"key":"S1351324922000213_ref251","doi-asserted-by":"crossref","unstructured":"Kwak, H. , Lee, C. , Park, H. and Moon, S. (2010). What is Twitter, a social network or a news media? In Proceedings of the Nineteenth International Conference on World Wide Web. ACM, pp. 591\u2013600.","DOI":"10.1145\/1772690.1772751"},{"key":"S1351324922000213_ref414","doi-asserted-by":"publisher","DOI":"10.4018\/978-1-59904-373-9.ch001"},{"key":"S1351324922000213_ref288","doi-asserted-by":"publisher","DOI":"10.3233\/IDA-150390"},{"key":"S1351324922000213_ref3","doi-asserted-by":"publisher","DOI":"10.1109\/KBEI.2017.8325018"},{"key":"S1351324922000213_ref427","doi-asserted-by":"publisher","DOI":"10.1201\/9780429326813"},{"key":"S1351324922000213_ref59","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000213_ref115","doi-asserted-by":"publisher","DOI":"10.1093\/database\/bay066"},{"key":"S1351324922000213_ref68","volume-title":"Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit","author":"Bird","year":"2009"},{"key":"S1351324922000213_ref170","unstructured":"Ghosh, S. , Johansson, R. , Riccardi, G. and Tonelli, S. (2011). Shallow discourse parsing with conditional random fields. In Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1071\u20131079."},{"key":"S1351324922000213_ref347","volume-title":"Python 3 Text Processing with NLTK 3 Cookbook","author":"Perkins","year":"2014"},{"key":"S1351324922000213_ref348","doi-asserted-by":"crossref","unstructured":"Peters, M.E. , Neumann, M. , Iyyer, M. , Gardner, M. , Clark, C. , Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.","DOI":"10.18653\/v1\/N18-1202"},{"key":"S1351324922000213_ref160","doi-asserted-by":"publisher","DOI":"10.1145\/378881.378888"},{"key":"S1351324922000213_ref56","doi-asserted-by":"publisher","DOI":"10.33011\/lilt.v6i.1239"},{"key":"S1351324922000213_ref385","unstructured":"Samad, M.D. , Khounviengxay, N.D. and Witherow, M.A. (2020). Effect of text processing steps on Twitter sentiment classification using word embedding. arXiv preprint arXiv:2007.13027."},{"key":"S1351324922000213_ref111","unstructured":"Cohen, K.B. , Roeder, C. , Baumgartner, W. A. Jr , Hunter, L.E. and Verspoor, K. (2010). Test suite design for ontology concept recognition systems, pp. 441\u2013446."},{"key":"S1351324922000213_ref388","first-page":"e26752","article-title":"The New York Times annotated corpus","volume":"6","author":"Sandhaus","year":"2008","journal-title":"Linguistic Data Consortium, Philadelphia"},{"key":"S1351324922000213_ref329","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1275"},{"key":"S1351324922000213_ref83","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W15-0705"},{"key":"S1351324922000213_ref432","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2015.09.004"},{"key":"S1351324922000213_ref144","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1047"},{"key":"S1351324922000213_ref167","doi-asserted-by":"publisher","DOI":"10.3386\/w23276"},{"key":"S1351324922000213_ref408","unstructured":"Sweeney, K. (2020). Unsupervised Machine Learning for Conference Scheduling: A Natural Language Processing Approach based on Latent Dirichlet Allocation. Master\u2019s Thesis, NHH Norwegian School of Economics, Bergen, Norway."},{"key":"S1351324922000213_ref272","doi-asserted-by":"publisher","DOI":"10.1142\/11116"},{"key":"S1351324922000213_ref225","first-page":"22","article-title":"An evaluation of preprocessing techniques for text classification","volume":"16","author":"Kadhim","year":"2018","journal-title":"International Journal of Computer Science and Information Security (IJCSIS)"},{"key":"S1351324922000213_ref407","doi-asserted-by":"publisher","DOI":"10.1002\/asi.21630"},{"key":"S1351324922000213_ref199","doi-asserted-by":"crossref","unstructured":"Hickman, L. , Thapa, S. , Tay, L. , Cao, M. and Srinivasan, P. (2020). Text preprocessing for text mining in organizational research: Review and recommendations. In Organizational Research Methods, pp. 1\u201358.","DOI":"10.1177\/1094428120971683"},{"key":"S1351324922000213_ref18","doi-asserted-by":"publisher","DOI":"10.1109\/WI-IAT.2015.90"},{"key":"S1351324922000213_ref193","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijinfomgt.2013.01.001"},{"key":"S1351324922000213_ref378","unstructured":"Robinson, D. (2018). gutenbergr: Download and Process Public Domain Works from Project Gutenberg. $\\mathsf{R}$ package version 0.1.4. Available from: https:\/\/CRAN.R-project.org\/package=gutenbergr."},{"key":"S1351324922000213_ref446","doi-asserted-by":"publisher","DOI":"10.1016\/j.jksuci.2020.05.006"},{"key":"S1351324922000213_ref93","doi-asserted-by":"publisher","DOI":"10.1145\/1316874.1316894"},{"key":"S1351324922000213_ref418","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2014.10.002"},{"key":"S1351324922000213_ref277","unstructured":"Lo, R.T.-W. , He, B. and Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. In Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), vol. 5, pp. 17\u201324."},{"key":"S1351324922000213_ref20","unstructured":"Allahyari, M. , Pouriyeh, S. , Assefi, M. , Safaei, S. , Trippe, E.D. , Gutierrez, J.B. and Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919."},{"key":"S1351324922000213_ref439","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2018.09.008"},{"key":"S1351324922000213_ref139","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-67008-9_31"},{"key":"S1351324922000213_ref238","doi-asserted-by":"publisher","DOI":"10.4324\/9780429024221"},{"key":"S1351324922000213_ref241","doi-asserted-by":"publisher","DOI":"10.3115\/1626394.1626412"},{"key":"S1351324922000213_ref282","doi-asserted-by":"publisher","DOI":"10.1515\/9783110211429"},{"key":"S1351324922000213_ref46","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-12-S3-S1"},{"key":"S1351324922000213_ref222","doi-asserted-by":"crossref","unstructured":"Joulin, A. , Grave, E. , Bojanowski, P. and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.","DOI":"10.18653\/v1\/E17-2068"},{"key":"S1351324922000213_ref33","doi-asserted-by":"publisher","DOI":"10.3115\/1654576.1654588"},{"key":"S1351324922000213_ref5","unstructured":"Acree, B.D. (2016). Deep Learning and Ideological Rhetoric. PhD Dissertation, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA."},{"key":"S1351324922000213_ref360","doi-asserted-by":"publisher","DOI":"10.1109\/ICoICT.2018.8528762"},{"key":"S1351324922000213_ref304","unstructured":"Mieke, S.S. (2016). Language diversity in ACL 2004\u20132016. ACL stands for the annual meeting of the Association for Computational Linguistics. Available from: https:\/\/sjmielke.com\/acl-language-diversity.htm."},{"key":"S1351324922000213_ref178","doi-asserted-by":"crossref","unstructured":"Grana, J. , Alonso, M.A. and Vilares, M. (2002). A common solution for tokenization and part-of-speech tagging. In International Conference on Text, Speech and Dialogue, vol. 2448. Springer, pp. 3\u201310.","DOI":"10.1007\/3-540-46154-X_1"},{"key":"S1351324922000213_ref280","doi-asserted-by":"crossref","unstructured":"Lourentzou, I. , Manghnani, K. and Zhai, C. (2019). Adapting sequence to sequence models for text normalization in social media. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 13, pp. 335\u2013345. AAAI stands for Association for the Advancement of Artificial Intelligence.","DOI":"10.1609\/icwsm.v13i01.3234"},{"key":"S1351324922000213_ref351","unstructured":"Poddar, L. (2016). Multilingual multiword expressions. arXiv preprint arXiv:1612.00246."},{"key":"S1351324922000213_ref211","unstructured":"Jean-Baptiste, E. (1916). Gammes st\u00e9nographiques. Technical report, Institut St\u00e9nographique de France, Paris."},{"key":"S1351324922000213_ref67","unstructured":"Bi, Y. (2016). Scheduling Optimization with LDA and Greedy Algorithm. Master\u2019s Thesis, Duke University, Durham NC, United States. LDA stands for latent Dirichlet allocation."},{"key":"S1351324922000213_ref412","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3323"},{"key":"S1351324922000213_ref159","volume-title":"Improving Survey Questions: Design and Evaluation","volume":"38","author":"Fowler","year":"1995"},{"key":"S1351324922000213_ref100","doi-asserted-by":"publisher","DOI":"10.29042\/2018-3764-3768"},{"key":"S1351324922000213_ref165","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1213"},{"key":"S1351324922000213_ref168","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-019-0112-6"},{"key":"S1351324922000213_ref297","unstructured":"McNamee, P. and Mayfield, J. (2007). N-gram morphemes for retrieval. In CLEF (Working Notes). CLEF stands for the Cross-Language Evaluation Forum workshop."},{"key":"S1351324922000213_ref64","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1126"},{"key":"S1351324922000213_ref49","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-020-05211-z"},{"key":"S1351324922000213_ref161","first-page":"114","article-title":"Rating the rating scales","volume":"9","author":"Friedman","year":"1999","journal-title":"Journal of Marketing Management"},{"key":"S1351324922000213_ref403","doi-asserted-by":"crossref","unstructured":"S\u00f8gaard, A. , de Lhoneux, M. and Augenstein, I. (2018). Nightmare at test time: How punctuation prevents parsers from generalizing. arXiv preprint arXiv:1809.00070.","DOI":"10.18653\/v1\/W18-5404"},{"key":"S1351324922000213_ref106","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbspro.2011.10.577"},{"key":"S1351324922000213_ref19","doi-asserted-by":"publisher","DOI":"10.7763\/IJIET.2012.V2.149"},{"key":"S1351324922000213_ref270","doi-asserted-by":"crossref","unstructured":"Lin, J. , Nogueira, R. and Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv preprint arXiv:2010.06467.","DOI":"10.2200\/S01123ED1V01Y202108HLT053"},{"key":"S1351324922000213_ref392","volume-title":"Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context","author":"Schuman","year":"1996"},{"key":"S1351324922000213_ref399","volume-title":"Collins English Dictionary","author":"Sinclair","year":"2018"},{"key":"S1351324922000213_ref166","first-page":"1","article-title":"Big data preprocessing: Methods and prospects","volume":"1","author":"Gar\u00edca","year":"2016","journal-title":"Big Data Analytics"},{"key":"S1351324922000213_ref224","article-title":"A study on NLP applications and ambiguity problems","volume":"96","author":"Jusoh","year":"2018","journal-title":"Journal of Theoretical and Applied Information Technology"},{"key":"S1351324922000213_ref367","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139058452"},{"key":"S1351324922000213_ref215","doi-asserted-by":"publisher","DOI":"10.1007\/s10791-007-9027-7"},{"key":"S1351324922000213_ref157","unstructured":"Fosler-Lussier, E. (1998). Markov models and hidden Markov models: A brief tutorial. International Computer Science Institute."},{"key":"S1351324922000213_ref138","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-0404"},{"key":"S1351324922000213_ref208","doi-asserted-by":"publisher","DOI":"10.1201\/9781003093459"},{"key":"S1351324922000213_ref50","doi-asserted-by":"publisher","DOI":"10.1007\/s00146-014-0549-4"},{"key":"S1351324922000213_ref189","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2013.05.005"},{"key":"S1351324922000213_ref341","unstructured":"Patro, G.K. , Chakraborty, A. , Ganguly, N. and Gummadi, K.P. (2020). On fair virtual conference scheduling: Achieving equitable participant and speaker satisfaction. arXiv preprint arXiv:2010.14624."},{"key":"S1351324922000213_ref338","doi-asserted-by":"publisher","DOI":"10.1145\/3351095.3372843"},{"key":"S1351324922000213_ref339","unstructured":"Pareti, S. , O\u2019Keefe, T. , Konstas, I. , Curran, J.R. and Koprinska, I. (2013). Automatically detecting and attributing indirect quotations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 989\u2013999."},{"key":"S1351324922000213_ref43","first-page":"615","volume-title":"International Conference on Intelligent Computing","author":"Bao","year":"2014"},{"key":"S1351324922000213_ref37","doi-asserted-by":"publisher","DOI":"10.1109\/INISTA.2011.5946149"},{"key":"S1351324922000213_ref271","doi-asserted-by":"publisher","DOI":"10.3115\/1075096.1075116"},{"key":"S1351324922000213_ref285","unstructured":"Luo, J. , Tinsley, J. and Lepage, Y. (2013). Exploiting parallel corpus for handling out-of-vocabulary words. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pp. 399\u2013408."},{"key":"S1351324922000213_ref186","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-009-9135-4"},{"key":"S1351324922000213_ref257","unstructured":"Lambert, P. and Banchs, R.E. (2006). Grouping multi-word expressions according to part-of-speech in statistical machine translation. In Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context."},{"key":"S1351324922000213_ref450","doi-asserted-by":"publisher","DOI":"10.1155\/2020\/1958149"},{"key":"S1351324922000213_ref133","unstructured":"Dressel, F. (2016). Distribution of n-grams in English text corpus. Available from: http:\/\/rpubs.com\/fdd\/187848."},{"key":"S1351324922000213_ref124","first-page":"238","article-title":"Automatic keyword extraction from any text document using N-gram rigid collocation","volume":"3","author":"Das","year":"2013","journal-title":"International Journal of Soft Computing and Engineering (IJSCE)"},{"key":"S1351324922000213_ref292","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/6754.001.0001"},{"key":"S1351324922000213_ref356","unstructured":"Potts, C. (n.d). Sentiment symposium tutorial: Stemming. Available from: http:\/\/sentiment.christopherpotts.net\/stemming.html [last accessed August 2019]."},{"key":"S1351324922000213_ref82","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-7552(97)00031-7"},{"key":"S1351324922000213_ref51","unstructured":"Battenberg, E. (2012). Ratings prediction using linear regression on text reviews. Technical report, Berkeley Institute of Design (BID), University of California \u2013 Berkeley, Berkeley CA, United States."},{"key":"S1351324922000213_ref28","doi-asserted-by":"publisher","DOI":"10.5220\/0005194303530360"},{"key":"S1351324922000213_ref141","unstructured":"El-Khair, I.A. (2017). Effects of stop words elimination for Arabic information retrieval: A comparative study. arXiv preprint arXiv:1702.01925."},{"key":"S1351324922000213_ref311","first-page":"1","article-title":"Recent advances in processing negation","author":"Morante","year":"2021","journal-title":"Natural Language Engineering"},{"key":"S1351324922000213_ref425","unstructured":"Tsarfaty, R. , Seddah, D. , Goldberg, Y. , K\u00fcbler, S. , Candito, M. , Foster, J. , Versley, Y. , Rehbein, I. and Tounsi, L. (2010). Statistical parsing of morphologically rich languages (SPMRL): What, how and whither. In Proceedings of the NAACL-HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, pp. 1\u201312. NAACL-HLT stands for the North American Chapter of the Association for Computational Linguistics: Human Language Technologies."},{"key":"S1351324922000213_ref383","first-page":"1","volume-title":"International Conference on Intelligent Text Processing and Computational Linguistics","author":"Sag","year":"2002"},{"key":"S1351324922000213_ref174","unstructured":"Goodman, E.L. , Zimmerman, C. and Hudson, C. (2020). Packet2vec: Utilizing word2vec for feature extraction in packet data. arXiv preprint arXiv:2004.14477."},{"key":"S1351324922000213_ref300","unstructured":"Metzler, H. , Baginski, H. , Niederkrotenthaler, T. and Garcia, D. (2021). Detecting potentially harmful and protective suicide-related content on Twitter: A machine learning approach. arXiv preprint arXiv:2112.04796."},{"key":"S1351324922000213_ref216","doi-asserted-by":"publisher","DOI":"10.1109\/BIBM.2015.7359756"},{"key":"S1351324922000213_ref261","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: A pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"S1351324922000213_ref437","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2007.86"},{"key":"S1351324922000213_ref328","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-2502"},{"key":"S1351324922000213_ref27","doi-asserted-by":"publisher","DOI":"10.1109\/ICIC54025.2021.9632884"},{"key":"S1351324922000213_ref116","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0129031"},{"key":"S1351324922000213_ref58","volume-title":"Applied Text Analysis with Python: Enabling Language-Aaware Data Products with Machine Learning","author":"Bengfort","year":"2018"},{"key":"S1351324922000213_ref121","volume-title":"The Penguin Dictionary of Language","author":"Crystal","year":"1999"},{"key":"S1351324922000213_ref287","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-78646-7_22"},{"key":"S1351324922000213_ref470","doi-asserted-by":"publisher","DOI":"10.1016\/j.eng.2019.12.014"},{"key":"S1351324922000213_ref196","doi-asserted-by":"publisher","DOI":"10.1109\/HICSS.2014.231"},{"key":"S1351324922000213_ref298","doi-asserted-by":"crossref","unstructured":"Mendoza, M. , Poblete, B. and Castillo, C. (2010). Twitter under crisis: Can we trust what we RT? In Proceedings of the First Workshop on Social Media Analytics. ACM, pp. 71\u201379.","DOI":"10.1145\/1964858.1964869"},{"key":"S1351324922000213_ref173","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-00810-9_15"},{"key":"S1351324922000213_ref262","doi-asserted-by":"publisher","DOI":"10.1146\/annurev-statistics-060116-054104"},{"key":"S1351324922000213_ref402","doi-asserted-by":"publisher","DOI":"10.1007\/s10936-011-9171-5"},{"key":"S1351324922000213_ref461","doi-asserted-by":"publisher","DOI":"10.3390\/app10217640"},{"key":"S1351324922000213_ref156","unstructured":"Forst, M. and Kaplan, R.M. (2006). The importance of precise tokenizing for deep grammars. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC\u201906)."},{"key":"S1351324922000213_ref228","doi-asserted-by":"crossref","unstructured":"Kalra, V. and Aggarwal, R. (2018). Importance of text data preprocessing & implementation in RapidMiner. In Proceedings of the First International Conference on Information Technology and Knowledge Management (ICITKM), vol. 14, pp. 71\u201375.","DOI":"10.15439\/2017KM46"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000213","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,26]],"date-time":"2024-09-26T21:50:26Z","timestamp":1727387426000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000213\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,13]]},"references-count":475,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,5]]}},"alternative-id":["S1351324922000213"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000213","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,13]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}