{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,3,8]],"date-time":"2024-03-08T02:01:18Z","timestamp":1709863278444},"reference-count":73,"publisher":"Cambridge University Press (CUP)","issue":"5","license":[{"start":{"date-parts":[[2019,7,31]],"date-time":"2019-07-31T00:00:00Z","timestamp":1564531200000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2019,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This article describes an unsupervised language model (LM) adaptation approach that can be used to enhance the performance of language identification methods. The approach is applied to a current version of the HeLI language identification method, which is now called HeLI 2.0. We describe the HeLI 2.0 method in detail. The resulting system is evaluated using the datasets from the German dialect identification and Indo-Aryan language identification shared tasks of the VarDial workshops 2017 and 2018. The new approach with LM adaptation provides considerably higher F1-scores than the basic HeLI or HeLI 2.0 methods or the other systems which participated in the shared tasks. The results indicate that unsupervised LM adaptation should be considered as an option in all language identification tasks, especially in those where encountering out-of-domain data is likely.<\/jats:p>","DOI":"10.1017\/s135132491900038x","type":"journal-article","created":{"date-parts":[[2019,7,31]],"date-time":"2019-07-31T02:34:01Z","timestamp":1564540441000},"page":"561-583","source":"Crossref","is-referenced-by-count":8,"title":["Language model adaptation for language and dialect identification of text"],"prefix":"10.1017","volume":"25","author":[{"given":"T.","family":"Jauhiainen","sequence":"first","affiliation":[]},{"given":"K.","family":"Lind\u00e9n","sequence":"additional","affiliation":[]},{"given":"H.","family":"Jauhiainen","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2019,7,31]]},"reference":[{"key":"S135132491900038X_ref71","unstructured":"Zavaliagkos G. and Colthurst T. (1998). Utilizing untranscribed training data to improve performance. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 301\u2013305."},{"key":"S135132491900038X_ref70","unstructured":"Zampieri M. , Tan L. , Ljube\u0161i\u0107 N. , Tiedemann J. and Nakov P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria, pp. 1\u20139."},{"key":"S135132491900038X_ref65","unstructured":"Wu N. , DeMattos E. , So K.H. , Chen P.-z. and \u00c7\u00f6ltekin \u00c7. (2019). Language discrimination and transfer learning for similar languages: experiments with feature combinations and adaptation. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA."},{"key":"S135132491900038X_ref64","unstructured":"Vatanen T. , V\u00e4yrynen J.J. and Virpioja S. (2010). Language identification of short text segments with N-gram models. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 3423\u20133430."},{"key":"S135132491900038X_ref63","doi-asserted-by":"publisher","DOI":"10.1006\/brln.2001.2556"},{"key":"S135132491900038X_ref62","unstructured":"Tiedemann J. and Ljube\u0161i\u0107 N. (2012). Efficient discrimination between closely related languages. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2619\u20132634."},{"key":"S135132491900038X_ref61","unstructured":"Tan L. , Zampieri M. , Ljube\u0161ic N. and Tiedemann J. (2014). Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora, Reykjavik, Iceland, pp. 11\u201315."},{"key":"S135132491900038X_ref57","unstructured":"Samard\u017ei\u0107 T. , Scherrer Y. and Glaser E. (2016). ArchiMob\u2013a corpus of spoken Swiss German. In Proceedings of the Language Resources and Evaluation (LREC), Portoroz, Slovenia, pp. 4061\u20134066."},{"key":"S135132491900038X_ref54","first-page":"37","article-title":"Multiple discriminant analysis in linguistic problems","volume":"4","author":"Mustonen","year":"1965","journal-title":"Statistical Methods in Linguistics"},{"key":"S135132491900038X_ref53","doi-asserted-by":"publisher","DOI":"10.1080\/09296170500500694"},{"key":"S135132491900038X_ref52","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1219"},{"key":"S135132491900038X_ref51","unstructured":"Malmasi S. , Zampieri M. , Ljube\u0161i\u0107 N. , Nakov P. , Ali A. and Tiedemann J. (2016). Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), Osaka, Japan."},{"key":"S135132491900038X_ref49","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2076"},{"key":"S135132491900038X_ref46","unstructured":"Kruengkrai C. , Sornlertlamvanich V. and Isahara H. (2006). Language, script, and encoding identification with string kernel classifiers. In Proceedings of the 1st International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2006), Ayutthaya, Thailand."},{"key":"S135132491900038X_ref44","unstructured":"Kestemont M. , Tschuggnall M. , Stamatatos E. , Daeleman W. , Specht G. , Stein B. and Potthast M. (2018). Overview of the author identification task at PAN-2018. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France."},{"key":"S135132491900038X_ref42","unstructured":"Jauhiainen T. , Lui M. , Zampieri M. , Baldwin T. and Lind\u00e9n K. (2018). Automatic language identification in texts: a survey. arXiv preprint arXiv:1804.08186."},{"key":"S135132491900038X_ref39","unstructured":"Jauhiainen T. , Lind\u00e9n K. and Jauhiainen H. (2016). HeLI, a word-based backoff method for language identification. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, pp. 153\u2013162."},{"key":"S135132491900038X_ref37","unstructured":"Jauhiainen T. , Jauhiainen H. and Lind\u00e9n K. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA."},{"key":"S135132491900038X_ref58","first-page":"1151","volume-title":"Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010)","author":"Scherrer","year":"2010"},{"key":"S135132491900038X_ref4","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1223"},{"key":"S135132491900038X_ref50","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1220"},{"key":"S135132491900038X_ref56","unstructured":"Priya R. , Ojha A.Kr. and Jha G.N. (2018). Automatic language identification system for Hindi and Magahi. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"S135132491900038X_ref69","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5307"},{"key":"S135132491900038X_ref12","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010933404324"},{"key":"S135132491900038X_ref1","unstructured":"Ali M. (2018a). Character level convolutional neural network for German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 172\u2013177."},{"key":"S135132491900038X_ref17","unstructured":"Ciobanu A.M. , Malmasi S. and Dinu L.P. (2018a). German dialect identification using classifier ensembles. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 288\u2013294."},{"key":"S135132491900038X_ref5","unstructured":"Barbaresi A. (2018). Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 164\u2013171."},{"key":"S135132491900038X_ref30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1135"},{"key":"S135132491900038X_ref31","volume-title":"Tekstin kielen automaattinen tunnistaminen","author":"Jauhiainen","year":"2010"},{"key":"S135132491900038X_ref43","doi-asserted-by":"publisher","DOI":"10.3115\/112405.112464"},{"key":"S135132491900038X_ref8","unstructured":"Bergsma S. , McNamee P. , Bagdouri M. , Fink C. and Wilson T. (2012). Language identification for creating language-specific Twitter collections. In Proceedings of the Second Workshop on Language in Social Media (LSM 2012), Montr\u00e9al, Canada, pp. 65\u201374."},{"key":"S135132491900038X_ref10","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1214"},{"key":"S135132491900038X_ref26","first-page":"108","volume-title":"Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2015)","author":"Hollenstein","year":"2015"},{"key":"S135132491900038X_ref55","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-016-0997-x"},{"key":"S135132491900038X_ref25","doi-asserted-by":"publisher","DOI":"10.3390\/a11040039"},{"key":"S135132491900038X_ref2","unstructured":"Ali M. (2018b). Character level convolutional neural network for Indo-Aryan language identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 283\u2013287."},{"key":"S135132491900038X_ref38","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-18111-0_48"},{"key":"S135132491900038X_ref68","unstructured":"Zampieri M. , Malmasi S. , Scherrer Y. , Samard\u017ei\u0107 T. , Tyers F. , Silfverberg M. , Klyueva N. , Pan T.-L. , Huang C.-R. , Ionescu R.T. , Butnaru A. and Jauhiainen T. (2019). A report on the third VarDial evaluation campaign. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA."},{"key":"S135132491900038X_ref20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1218"},{"key":"S135132491900038X_ref66","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1201"},{"key":"S135132491900038X_ref48","volume-title":"Ethnologue: Languages of the World","author":"Lewis","year":"2013"},{"key":"S135132491900038X_ref45","unstructured":"Kone Foundation (2012). The language programme 2012-2016."},{"key":"S135132491900038X_ref33","unstructured":"Jauhiainen T. , Jauhiainen H. and Lind\u00e9n K. (2015b). Discriminating similar languages with token-based backoff. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria, pp. 44\u201351."},{"key":"S135132491900038X_ref11","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4408"},{"key":"S135132491900038X_ref7","unstructured":"Benites F. , von D\u00e4niken P. and Cieliebak M. (2019). TwistBytes - identification of Cuneinform languages and German dialects at VarDial 2019. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA."},{"key":"S135132491900038X_ref16","first-page":"270","article-title":"Language model adaptation and confidence measure for robust language identification","volume":"1","author":"Chen","year":"2005","journal-title":"Proceedings of ISCIT 2005"},{"key":"S135132491900038X_ref29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1225"},{"key":"S135132491900038X_ref60","doi-asserted-by":"publisher","DOI":"10.1109\/ICCCNT.2013.6726777"},{"key":"S135132491900038X_ref72","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-73110-8_58"},{"key":"S135132491900038X_ref3","doi-asserted-by":"crossref","unstructured":"Bacchiani M. and Roark B. (2003). Unsupervised language model adaptation. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), pp. 224\u2013227.","DOI":"10.1109\/ICASSP.2003.1198758"},{"key":"S135132491900038X_ref6","unstructured":"Benites F. , Grubenmann R. , von D\u00e4niken P. , von Gr\u00fcnigen D. , Deriu J. and Cieliebak M. (2018). Twist Bytes - German dialect identification with data mining optimization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 218\u2013227."},{"key":"S135132491900038X_ref14","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40585-3_60"},{"key":"S135132491900038X_ref9","unstructured":"Bernier-Colborne G. , Goutte C. and L\u00e9ger S. (2019). Improving Cuneiform language identification with BERT. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA."},{"key":"S135132491900038X_ref18","unstructured":"Ciobanu A.M. , Zampieri M. , Malmasi S. , Pal S. and Dinu L.P. (2018b). Discriminating between Indo-Aryan languages using SVM ensembles. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 178\u2013184."},{"key":"S135132491900038X_ref13","doi-asserted-by":"publisher","DOI":"10.1016\/j.diin.2012.05.004"},{"key":"S135132491900038X_ref73","unstructured":"Zlatkova D. , Kopev D. , Mitov K. , Atanasov A. , Hardalov M. , Koychev I. and Preslav N. (2018). An ensemble-rich multi-aspect approach for robust style change detection - notebook for PAN at CLEF-2018. In Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France."},{"key":"S135132491900038X_ref15","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1069"},{"key":"S135132491900038X_ref19","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1221"},{"key":"S135132491900038X_ref59","unstructured":"Sibun P. and Reynar J.C. (1996). Language identification: examining the issues. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-96), Las Vegas, USA, pp. 125\u2013135."},{"key":"S135132491900038X_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1213"},{"key":"S135132491900038X_ref28","first-page":"327","article-title":"Text based language identification system for Indian languages following Devanagiri script","volume":"3","author":"Indhuja","year":"2014","journal-title":"International Journal of Engineering Research and Technology"},{"key":"S135132491900038X_ref41","unstructured":"Jauhiainen T. , Lind\u00e9n K. and Jauhiainen H. (2017b). Evaluation of language identification methods using 285 languages. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa 2017), Gothenburg, Sweden, pp. 183\u2013191."},{"key":"S135132491900038X_ref21","unstructured":"\u00c7\u00f6ltekin \u00c7. , Rama T. and Blaschke V. (2018). T\u00fcbingen-Oslo team at the VarDial 2018 evaluation campaign: an analysis of n-gram features in language variety identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 55\u201365."},{"key":"S135132491900038X_ref23","unstructured":"Gupta D. , Dhakad G. , Gupta J. and Singh, A.K. (2018). IIT (BHU) system, for Indo-Aryan language identification (ILI) at VarDial 2018. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 185\u2013190."},{"key":"S135132491900038X_ref47","unstructured":"Kumar R. , Lahiri B. , Alok D. , Ojha A.Kr. , Jain M. , Basit A. and Dawar Y. (2018). Automatic identification of closely-related Indian languages: resources and experiments. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan."},{"key":"S135132491900038X_ref40","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1212"},{"key":"S135132491900038X_ref24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1211"},{"key":"S135132491900038X_ref27","volume-title":"Wiley Series in Probability and Statistics","author":"Hosmer","year":"2013"},{"key":"S135132491900038X_ref32","doi-asserted-by":"publisher","DOI":"10.7557\/5.3471"},{"key":"S135132491900038X_ref34","unstructured":"Jauhiainen T. , Jauhiainen H. and Lind\u00e9n K. (2018a). HeLI-based experiments in Swiss German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 254\u2013262."},{"key":"S135132491900038X_ref35","unstructured":"Jauhiainen T. , Jauhiainen H. and Lind\u00e9n K. (2018b). HeLI-based experiments in discriminating between Dutch and Flemish subtitles. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 137\u2013144."},{"key":"S135132491900038X_ref67","unstructured":"Zampieri M. , Malmasi S. , Nakov P. , Ali A. , Shon S. , Glass J. , Scherrer Y. , Samard\u017ei T. , Ljube\u0161i\u0107 N. , Tiedemann J. , van der Lee C. , Grondelaers S. , Oostdijk N. , van den Bosch A. , Kumar R. , Lahiri B. and Jain M. (2018). Language identification and morphosyntactic tagging: the second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM."},{"key":"S135132491900038X_ref36","unstructured":"Jauhiainen T. , Jauhiainen H. and Lind\u00e9n K. (2018c). Iterative language model adaptation for Indo-Aryan language identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Santa Fe, NM, pp. 66\u201375."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S135132491900038X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,9,24]],"date-time":"2019-09-24T04:20:15Z","timestamp":1569298815000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S135132491900038X\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,31]]},"references-count":73,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2019,9]]}},"alternative-id":["S135132491900038X"],"URL":"https:\/\/doi.org\/10.1017\/s135132491900038x","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,7,31]]}}}