{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T22:00:05Z","timestamp":1747173605805,"version":"3.40.5"},"reference-count":76,"publisher":"Cambridge University Press (CUP)","issue":"5","license":[{"start":{"date-parts":[[2019,9,9]],"date-time":"2019-09-09T00:00:00Z","timestamp":1567987200000},"content-version":"unspecified","delay-in-days":8,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2019,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger\u2019s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.<\/jats:p>","DOI":"10.1017\/s1351324919000366","type":"journal-article","created":{"date-parts":[[2019,9,9]],"date-time":"2019-09-09T11:13:56Z","timestamp":1568027636000},"page":"651-674","source":"Crossref","is-referenced-by-count":3,"title":["How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts"],"prefix":"10.1017","volume":"25","author":[{"given":"Katja","family":"Zupan","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7169-9152","authenticated-orcid":false,"given":"Nikola","family":"Ljube\u0161i\u0107","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1560-4099","authenticated-orcid":false,"given":"Toma\u017e","family":"Erjavec","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2019,9,9]]},"reference":[{"key":"S1351324919000366_ref6","unstructured":"Bollmann, M. , Dipper, S. , Krasselt, J. , and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In KONVENS, pp. 342\u2013350."},{"key":"S1351324919000366_ref26","first-page":"67","article-title":"JANES v0.4 : korpus slovenskih spletnih uporabni\u0161kih vsebin (JANES 04: a corpus of Slovene User Generated Content","volume":"4","author":"Fi\u0161er","year":"2016","journal-title":"Sloven\u0161\u010dina 2.0"},{"key":"S1351324919000366_ref9","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974178"},{"key":"S1351324919000366_ref17","unstructured":"Eisenstein, J. (2013). What to do about bad language on the Internet. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 359\u2013369."},{"key":"S1351324919000366_ref61","unstructured":"Rayson, P. , Archer, D. , Baron, A. , Culpeper, J. , and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics Conference: CL 2007. UCREL."},{"key":"S1351324919000366_ref21","first-page":"1","article-title":"The IMP historical Slovene language resources","author":"Erjavec","year":"2015a","journal-title":"Language Resources and Evaluation"},{"key":"S1351324919000366_ref7","unstructured":"Bollmann, M. , Krasselt, J. , and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling - Case studies from Early New High German. In Proceedings of KONVENS 2012 (LThist 2012 Workshop, pp. 342\u2013350."},{"key":"S1351324919000366_ref4","unstructured":"Bennett, P. , Durrell, M. , Scheible, S. , and Whitt, R.J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 Workshop on Language Resource and Language Technology: Standards - State of the Art, Emerging Needs, and Future Developments, Paris, pp. 64\u201368."},{"key":"S1351324919000366_ref40","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073462"},{"key":"S1351324919000366_ref48","unstructured":"Ljube\u0161i\u0107, N. , Klubi\u010dka, F. , Agi\u0107, \u017d. , and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Tenth International Conference on Language Resources and Evaluation (LREC 2016)."},{"key":"S1351324919000366_ref37","doi-asserted-by":"crossref","unstructured":"Kim, Y. , Jernite, Y. , Sontag, D. , and Rush, A.M. (2016). Character-aware neural language models. In AAAI, pp. 2741\u20132749.","DOI":"10.1609\/aaai.v30i1.10362"},{"key":"S1351324919000366_ref3","doi-asserted-by":"crossref","first-page":"157","DOI":"10.21248\/jlcl.28.2013.172","article-title":"Optimierung des Stuttgart-T\u00fcbingen-Tagset f\u00fcr die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Ph\u00e4nomene, Herausforderungen, Erweiterungsvorschl\u00e4ge","volume":"28","author":"Bartz","year":"2014","journal-title":"Journal for Language Technology and Computational Linguistics"},{"key":"S1351324919000366_ref54","unstructured":"Owoputi, O. , O\u2019Connor, B. , Dyer, C. , Gimpel, K. , Schneider, N. , and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT, pp. 380\u2013390."},{"key":"S1351324919000366_ref10","first-page":"467","article-title":"Class-based n-gram models of natural language","volume":"18","author":"Brown","year":"1992","journal-title":"Computational Linguistics"},{"key":"S1351324919000366_ref1","unstructured":"Bahdanau, D. , Cho, K. , and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473."},{"key":"S1351324919000366_ref56","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-0605"},{"key":"S1351324919000366_ref39","doi-asserted-by":"publisher","DOI":"10.3115\/1557769.1557821"},{"key":"S1351324919000366_ref33","unstructured":"Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 368\u2013378. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id=2002472.2002520"},{"key":"S1351324919000366_ref41","unstructured":"Krek, S. , Dobrovoljc, K. , Erjavec, T. , Mo\u017ee, S. , Ledinek, N. , and Holz, N. (2015). Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI. http:\/\/hdl.handle.net\/11356\/1052."},{"key":"S1351324919000366_ref66","unstructured":"Scherrer, Y. and Erjavec, T. (2016b). Modernising historical Slovene words. Natural Language Engineering, FirstView, 1\u201325. Retrieved from http:\/\/journals.cambridge.org\/article_S1351324915000236"},{"key":"S1351324919000366_ref59","unstructured":"Plank, B. , S\u00f8gaard, A. , and Goldberg, Y. (2016). Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529."},{"key":"S1351324919000366_ref72","unstructured":"Vilar, D. , Peter, J.-T. , and Ney, H. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 33\u201339."},{"key":"S1351324919000366_ref20","unstructured":"Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. http:\/\/hdl.handle.net\/11356\/1031"},{"key":"S1351324919000366_ref52","doi-asserted-by":"publisher","DOI":"10.1007\/BF02295996"},{"key":"S1351324919000366_ref5","unstructured":"Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In LAW@ ACL, pp. 11\u201318."},{"key":"S1351324919000366_ref30","unstructured":"Greene, B. and Rubin, G. (1971). Automatic Grammatical Tagging of English. Department of Linguistics, Brown University. Retrieved from https:\/\/books.google.si\/books?id=VznTygAACAAJ"},{"key":"S1351324919000366_ref44","unstructured":"Ljube\u0161i\u0107, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA)."},{"key":"S1351324919000366_ref11","doi-asserted-by":"publisher","DOI":"10.3115\/1118693.1118694"},{"key":"S1351324919000366_ref12","unstructured":"De Clercq, O. , Schulz, S. , Desmet, B. , Lefever, E. , and Hoste, V. (2013). Normalization of Dutch user-generated content. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 179\u2013188."},{"key":"S1351324919000366_ref15","unstructured":"Dipper, S. (2010). POS-tagging of historical language data: First experiments. In KONVENS, pp. 117\u2013121."},{"key":"S1351324919000366_ref16","unstructured":"Dobrovoljc, K. , Krek, S. , Holozan, P. , Erjavec, T. , and Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http:\/\/hdl.handle.net\/11356\/1039."},{"key":"S1351324919000366_ref18","unstructured":"Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 33\u201338."},{"key":"S1351324919000366_ref19","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-011-9174-8"},{"key":"S1351324919000366_ref23","unstructured":"Erjavec, T. , Fi\u0161er, D. , \u010dibej, J. , Arhar Holdt, \u0160. , Ljube\u0161i\u0107, N. , and Zupan, K. (2017). CMC training corpus Janes-Tag 2.0. Slovenian language resource repository CLARIN.SI. http:\/\/hdl.handle.net\/11356\/1123."},{"key":"S1351324919000366_ref22","unstructured":"Erjavec, T. (2015b). Reference corpus of historical Slovene goo300k 1.2. Slovenian language resource repository CLARIN.SI. http:\/\/hdl.handle.net\/11356\/1025."},{"key":"S1351324919000366_ref57","unstructured":"Pettersson, E. , Megyesi, B. , and Tiedemann, J. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013; May 22\u201324; 2013, Oslo, Norway: Nealt Proceedings Series, Vol. 18, pp. 54\u201369."},{"key":"S1351324919000366_ref28","unstructured":"Gimpel, K. , Schneider, N. , O\u2019Connor, B. , Das, D. , Mills, D. , Eisenstein, J. , \u2026 Smith, N.A. (2011). Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2, pp. 42\u201347."},{"key":"S1351324919000366_ref29","doi-asserted-by":"publisher","DOI":"10.5120\/11638-7118"},{"key":"S1351324919000366_ref31","unstructured":"Gr\u010dar, M. , Krek, S. , and Dobrovoljc, K. (2012). Obeliks: statisti\u010dni oblikoskladenjski ozna\u010devalnik in lematizator za slovenski jezik (obeliks: a statistical morphosyntactic tagger and lemmatiser for Slovene). In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia."},{"key":"S1351324919000366_ref34","doi-asserted-by":"crossref","first-page":"65","DOI":"10.21248\/jlcl.26.2011.147","article-title":"From old texts to modern spellings: An experiment in automatic normalisation","volume":"26","author":"Hendrickx","year":"2011","journal-title":"JLCL"},{"key":"S1351324919000366_ref35","first-page":"166","article-title":"Effectiveness of domain adaptation approaches for social media POS tagging","author":"Horsmann","year":"2015","journal-title":"CLiC it"},{"key":"S1351324919000366_ref36","unstructured":"Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In EMNLP, Vol. 3, p. 413."},{"key":"S1351324919000366_ref38","unstructured":"Koehn, P. (2017). Neural machine translation. CoRR, abs\/1709.07809. Retrieved from http:\/\/arxiv.org\/abs\/1709.07809"},{"key":"S1351324919000366_ref13","unstructured":"Derczynski, L. , Chester, S. , and B\u00f8gh, K.S. (2015). Tune your Brown clustering, please. In International Conference Recent Advances in Natural Language Processing, RANLP, Vol. 2015, pp. 110\u2013117."},{"key":"S1351324919000366_ref42","doi-asserted-by":"publisher","DOI":"10.1002\/9781119145554"},{"key":"S1351324919000366_ref45","unstructured":"Ljube\u0161i\u0107, N. , Erjavec, T. , and Fi\u0161er, D. (2016). Corpus-based diacritic restoration for South Slavic languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA)."},{"key":"S1351324919000366_ref32","doi-asserted-by":"publisher","DOI":"10.3115\/1557769.1557830"},{"key":"S1351324919000366_ref46","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23538-2_50"},{"key":"S1351324919000366_ref47","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1410"},{"key":"S1351324919000366_ref49","unstructured":"Ljube\u0161i\u0107, N. , Zupan, K. , Fi\u0161er, D. , and Erjavec, T. (2016). Normalising Slovene data: historical texts vs. user-generated content. Bochumer Linguistische Arbeitsberichte, 146\u2013155."},{"key":"S1351324919000366_ref51","unstructured":"Matthews, D. (2007). Machine transliteration of proper names. Master\u2019s Thesis, University of Edinburgh, Edinburgh."},{"key":"S1351324919000366_ref25","unstructured":"Etxeberria, I. , Alegria, I. , Uria, L. , and Hulden, M. (2016). Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoro\u017e, Slovenia: European Language Resources Association (ELRA)."},{"key":"S1351324919000366_ref53","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-71496-5_5"},{"key":"S1351324919000366_ref55","unstructured":"Pettersson, E. , Megyesi, B. , and Nivre, J. (2013). Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22\u201324, 2013; Oslo University; Norway. nealt Proceedings Series, Vol. 16, pp. 163\u2013179."},{"volume-title":"Synthesis Lectures on Human Language Technologies","year":"2012","author":"Piotrowski","key":"S1351324919000366_ref58"},{"key":"S1351324919000366_ref60","unstructured":"Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pp. 133\u2013142. Retrieved from http:\/\/www.aclweb.org\/anthology\/W96-0213"},{"key":"S1351324919000366_ref62","unstructured":"Ritter, A. , Clark, S. , and Etzioni, O. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524\u20131534."},{"key":"S1351324919000366_ref63","unstructured":"Scheible, S. , Whitt, R.J. , Durrell, M. , and Bennett, P. (2011). Evaluating an \u2018off-the-shelf\u2019 POS-tagger on Early Modern German text. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 19\u201323."},{"key":"S1351324919000366_ref64","unstructured":"Scheible, S. , Whitt, R.J. , Durrell, M. , and Bennett, P. (2012). Gatetogermanc: A GATE-based annotation pipeline for historical German. In LREC, pp. 3611\u20133617."},{"key":"S1351324919000366_ref65","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324915000236"},{"key":"S1351324919000366_ref50","unstructured":"Lusetti, M. , Ruzsics, T. , G\u00f6hring, A. , Samard\u017ei\u0107, T. , and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 18\u201328). Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https:\/\/www.aclweb.org\/anthology\/W18-3902"},{"key":"S1351324919000366_ref67","first-page":"248","article-title":"Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation","author":"Scherrer","year":"2016","journal-title":"Bochumer Linguistische Arbeitsberichte"},{"key":"S1351324919000366_ref68","unstructured":"Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK."},{"key":"S1351324919000366_ref2","unstructured":"Baron, A. , and Rayson, P. (2008). Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate Conference in Corpus Linguistics."},{"key":"S1351324919000366_ref69","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2043"},{"key":"S1351324919000366_ref74","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/N15-1069"},{"key":"S1351324919000366_ref73","unstructured":"Yang, Y. and Eisenstein, J. (2014). Unsupervised domain adaptation with feature embeddings. arXiv preprint arXiv:1412.4385."},{"key":"S1351324919000366_ref75","unstructured":"Yang, Y. and Eisenstein, J. (2016). Part-of-speech tagging for historical English. arXiv preprint arXiv:1603.03144."},{"key":"S1351324919000366_ref76","unstructured":"Zampieri, M. , Malmasi, S. , Nakov, P. , Ali, A. , Shon, S. , Glass, J. , \u2026 Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1\u201317, Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https:\/\/www.aclweb.org\/anthology\/W18-3901"},{"key":"S1351324919000366_ref8","first-page":"191","article-title":"An efficient memory-based morphosyntactic tagger and parser for Dutch","volume":"7","author":"Bosch, Van Den","year":"2007","journal-title":"LOT Occasional Series"},{"key":"S1351324919000366_ref70","unstructured":"TEI Consortium (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. Retrieved from http:\/\/www.tei-c.org\/Guidelines\/P5\/"},{"key":"S1351324919000366_ref24","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324918000505"},{"key":"S1351324919000366_ref27","unstructured":"Foster, J. , \u00c7etinoglu, \u00d6. , Wagner, J. , Le Roux, J. , \u2026 Van Genabith, J. (2011). # hardtoparse: POS tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pp. 20\u201325."},{"key":"S1351324919000366_ref71","first-page":"53","article-title":"The CLIN27 shared task : Translating historical text to contemporary language for improving automatic linguistic annotation","volume":"7","author":"Tjong Kim Sang","year":"2017","journal-title":"Computational Linguistics in the Netherlands Journal"},{"key":"S1351324919000366_ref14","unstructured":"Derczynski, L. , Ritter, A. , Clark, S. , and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198\u2013206."},{"key":"S1351324919000366_ref43","unstructured":"Ling, W. , Trancoso, I. , Dyer, C. , and Black, A.W. (2015). Character-based neural machine translation. arXiv preprint arXiv:1511.04586."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324919000366","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,20]],"date-time":"2023-09-20T01:15:30Z","timestamp":1695172530000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324919000366\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,9]]},"references-count":76,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2019,9]]}},"alternative-id":["S1351324919000366"],"URL":"https:\/\/doi.org\/10.1017\/s1351324919000366","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2019,9]]}}}