{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T04:55:04Z","timestamp":1770612904390,"version":"3.49.0"},"reference-count":41,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2021,3,19]],"date-time":"2021-03-19T00:00:00Z","timestamp":1616112000000},"content-version":"vor","delay-in-days":2,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,3,17]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language\u2019s morphology on language modeling.<\/jats:p>","DOI":"10.1162\/tacl_a_00365","type":"journal-article","created":{"date-parts":[[2021,4,21]],"date-time":"2021-04-21T15:18:55Z","timestamp":1619018335000},"page":"261-276","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":13,"title":["Morphology Matters: A Multilingual Language Modeling Analysis"],"prefix":"10.1162","volume":"9","author":[{"given":"Hyunji Hayley","family":"Park","sequence":"first","affiliation":[{"name":"University of Illinois. hpark129@illinois.edu"}]},{"given":"Katherine J.","family":"Zhang","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University. kjzhang@cmu.edu"}]},{"given":"Coleman","family":"Haley","sequence":"additional","affiliation":[{"name":"Johns Hopkins University. chaley7@jhu.edu"}]},{"given":"Kenneth","family":"Steimel","sequence":"additional","affiliation":[{"name":"Indiana University. ksteimel@iu.edu"}]},{"given":"Han","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Chicago. hanliu@uchicago.edu"}]},{"given":"Lane","family":"Schwartz","sequence":"additional","affiliation":[{"name":"University of Illinois. lanes@illinois.edu"}]}],"member":"281","published-online":{"date-parts":[[2021,3,17]]},"reference":[{"key":"2021042114534757300_bib1","article-title":"Finite-state transducer-based computational model of Plains Cree morphology","author":"Arppe","year":"2014\u20132019"},{"key":"2021042114534757300_bib2","article-title":"Helsinki finite-state technology resources","author":"Axelson","year":"2015"},{"issue":"1","key":"2021042114534757300_bib3","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: A practical and powerful approach to multiple testing","volume":"57","author":"Benjamini","year":"1995","journal-title":"Journal of the Royal Statistical Society: Series B (Methodological)"},{"key":"2021042114534757300_bib4","first-page":"142","article-title":"A comparison between morphological complexity measures: Typological data vs. language corpora","volume-title":"Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)","author":"Bentz","year":"2016"},{"key":"2021042114534757300_bib5","article-title":"Byte pair encoding is suboptimal for language model pretraining","author":"Bostrom","year":"2020","journal-title":"CoRR"},{"key":"2021042114534757300_bib6","article-title":"A freely available morphological analyzer for Turkish","volume-title":"Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC\u201910)","author":"\u00c7\u00f6ltekin","year":"2010"},{"key":"2021042114534757300_bib7","article-title":"A set of open source tools for Turkish natural language processing","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914)","author":"\u00c7\u00f6ltekin","year":"2014"},{"key":"2021042114534757300_bib8","first-page":"1","article-title":"A massively parallel corpus: The Bible in 100 languages","volume":"49","author":"Christodoulopoulos","year":"2014","journal-title":"Language Resources and Evaluation"},{"key":"2021042114534757300_bib9","first-page":"536","article-title":"Are all languages equally hard to language-model?","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Cotterell","year":"2018"},{"issue":"2","key":"2021042114534757300_bib10","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1080\/09296171003643098","article-title":"Cutting the Gordian knot: The moving-average type\u2013token ratio (MATTR)","volume":"17","author":"Covington","year":"2010","journal-title":"Journal of Quantitative Linguistics"},{"issue":"1","key":"2021042114534757300_bib11","doi-asserted-by":"crossref","first-page":"3:1","DOI":"10.1145\/1187415.1187418","article-title":"Unsupervised models for morpheme segmentation and morphology learning","volume":"4","author":"Creutz","year":"2007","journal-title":"ACM Transactions on Speech and Language Processing"},{"key":"2021042114534757300_bib12","doi-asserted-by":"crossref","first-page":"2864","DOI":"10.18653\/v1\/D18-1312","article-title":"A framework for understanding the role of morphology in universal dependency parsing","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Dehouck","year":"2018"},{"key":"2021042114534757300_bib13","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2021042114534757300_bib14","volume-title":"WALS Online","author":"Dryer","year":"2013"},{"key":"2021042114534757300_bib15","doi-asserted-by":"crossref","first-page":"316","DOI":"10.18653\/v1\/D18-1029","article-title":"On the relation between linguistic typology and (limitations of) multilingual language modeling","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Gerz","year":"2018"},{"issue":"3","key":"2021042114534757300_bib16","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1080\/09296174.2014.911506","article-title":"Can type-token ratio be used to show morphological complexity of languages?","volume":"21","author":"Kettunen","year":"2014","journal-title":"Journal of Quantitative Linguistics"},{"key":"2021042114534757300_bib17","article-title":"UniMorph 2.0: Universal morphology","volume-title":"Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)","author":"Kirov","year":"2018"},{"key":"2021042114534757300_bib18","first-page":"1","article-title":"Computational challenges for polysynthetic languages","volume-title":"Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages","author":"Klavans","year":"2018"},{"key":"2021042114534757300_bib19","first-page":"79","article-title":"Europarl: A parallel corpus for statistical machine translation","volume-title":"Proceedings of the Tenth Machine Translation Summit","author":"Koehn","year":"2005"},{"key":"2021042114534757300_bib20","doi-asserted-by":"crossref","first-page":"66","DOI":"10.18653\/v1\/P18-1007","article-title":"Subword regularization: Improving neural network translation models with multiple subword candidates","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Kudo","year":"2018"},{"key":"2021042114534757300_bib21","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1007\/978-3-642-23138-4_8","article-title":"Indonesian morphology tool (MorphInd): Towards an indonesian corpus","volume-title":"Systems and Frameworks for Computational Morphology","author":"Larasati","year":"2011"},{"key":"2021042114534757300_bib22","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"CoRR"},{"key":"2021042114534757300_bib23","first-page":"73","article-title":"Lost in translation: Analysis of information loss during machine translation between polysynthetic and fusional languages","volume-title":"Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages","author":"Mager","year":"2018"},{"key":"2021042114534757300_bib24","first-page":"3158","article-title":"Creating a massively parallel Bible corpus","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Mayer","year":"2014"},{"key":"2021042114534757300_bib25","article-title":"An analysis of neural language modeling at multiple scales","author":"Merity","year":"2018","journal-title":"CoRR"},{"key":"2021042114534757300_bib26","article-title":"Language diversity in ACL 2004 - 2016","author":"Mielke","year":"2016"},{"key":"2021042114534757300_bib27","doi-asserted-by":"crossref","first-page":"4975","DOI":"10.18653\/v1\/P19-1491","article-title":"What kind of language is hard to language-model?","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Mielke","year":"2019"},{"key":"2021042114534757300_bib28","doi-asserted-by":"crossref","first-page":"68436850","DOI":"10.1609\/aaai.v33i01.33016843","article-title":"Spell once, summon anywhere: A two-level open-vocabulary language model","volume":"33","author":"Mielke","year":"2019","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"2021042114534757300_bib29","first-page":"313","article-title":"Omorfi \u2014 free and open source morphological lexical database for Finnish","volume-title":"Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)","author":"Pirinen","year":"2015"},{"key":"2021042114534757300_bib30","article-title":"Language models are unsupervised multitask learners","author":"Radford","year":"2019"},{"key":"2021042114534757300_bib31","article-title":"Comparing complexity measures","volume-title":"Computational Approaches to Morphological Complexity","author":"Sagot","year":"2013"},{"key":"2021042114534757300_bib32","first-page":"1","article-title":"SMOR: A German computational morphology covering derivation, composition and inflection","volume-title":"Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC\u201904)","author":"Schmid","year":"2004"},{"key":"2021042114534757300_bib33","article-title":"Neural polysynthetic language modelling","author":"Schwartz","year":"2020","journal-title":"CoRR"},{"key":"2021042114534757300_bib34","doi-asserted-by":"crossref","first-page":"1715","DOI":"10.18653\/v1\/P16-1162","article-title":"Neural machine translation of rare words with subword units","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sennrich","year":"2016"},{"key":"2021042114534757300_bib35","unstructured":"Yusuke Shibata , TakuyaKida, ShuichiFukamachi, MasayukiTakeda, AyumiShinohara, TakeshiShinohara, and SetsuoArikawa. 1999. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical report, Department of Informatics, Kyushu University."},{"issue":"21","key":"2021042114534757300_bib36","first-page":"19","article-title":"The need to report effect size estimates revisited. An overview of some recommended measures of effect size","volume":"1","author":"Tomczak","year":"2014","journal-title":"Trends in Sport Sciences"},{"key":"2021042114534757300_bib37","first-page":"195","article-title":"Dependency annotation of noun incorporation in polysynthetic languages","volume-title":"Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)","author":"Tyers","year":"2020"},{"key":"2021042114534757300_bib38","doi-asserted-by":"crossref","first-page":"2016","DOI":"10.18653\/v1\/P17-1184","article-title":"From characters to words to in between: Do we capture morphology?","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Vania","year":"2017"},{"key":"2021042114534757300_bib39","article-title":"Analizador morf\u00f3logico de la lengua Quechua basado en software libre Helsinkifinite-statetransducer (HFST)","author":"Vilca","year":"2012"},{"key":"2021042114534757300_bib40","unstructured":"Sami Virpioja , PeterSmit, Stig-ArneGr\u00f6nroos, and MikkoKurimo. 2013. Morfessor 2.0: Python implementation and extensions for Morfessor baseline. Technical report, Aalto University; Aalto-yliopisto."},{"key":"2021042114534757300_bib41","first-page":"5753","article-title":"XLNet: Generalized autoregressive pretraining for language understanding","volume-title":"Advances in Neural Information Processing Systems 32","author":"Yang","year":"2019"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00365\/1896757\/tacl_a_00365.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00365\/1896757\/tacl_a_00365.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,28]],"date-time":"2024-08-28T23:29:27Z","timestamp":1724887767000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00365\/98237\/Morphology-Matters-A-Multilingual-Language"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,17]]},"references-count":41,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00365","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,3,17]]}}}