{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T12:22:01Z","timestamp":1773922921744,"version":"3.50.1"},"reference-count":72,"publisher":"Cambridge University Press (CUP)","issue":"4","license":[{"start":{"date-parts":[[2019,7,24]],"date-time":"2019-07-24T00:00:00Z","timestamp":1563926400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For this purpose, we have built historical corpora for each period, which have been compiled from different open corpus sources containing texts as close as possible to its original spelling. The results of our experiments show that a diachronic language distance based on perplexity detects the linguistic evolution that had already been explained by the historians of the three languages. It is remarkable to underline that it is an unsupervised multilingual method which only needs a raw corpora organized by periods.<\/jats:p>","DOI":"10.1017\/s1351324919000378","type":"journal-article","created":{"date-parts":[[2019,7,24]],"date-time":"2019-07-24T06:27:48Z","timestamp":1563949668000},"page":"433-454","source":"Crossref","is-referenced-by-count":4,"title":["Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish"],"prefix":"10.1017","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5172-6803","authenticated-orcid":false,"given":"Jos\u00e9 Ramom Pichel","family":"Campos","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5819-2469","authenticated-orcid":false,"given":"Pablo Gamallo","family":"Otero","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0272-1472","authenticated-orcid":false,"given":"I\u00f1aki Alegria","family":"Loinaz","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2019,7,24]]},"reference":[{"key":"S1351324919000378_ref70","first-page":"580","article-title":"N-gram language models and POS distribution for the identification of Spanish varieties","volume":"2","author":"Zampieri","year":"2013","journal-title":"Proceedings of TALN"},{"key":"S1351324919000378_ref66","doi-asserted-by":"publisher","DOI":"10.9783\/9781512808957"},{"key":"S1351324919000378_ref67","first-page":"599","article-title":"Cipm\u2013um corpus informatizado do portugu\u00eas medieval","volume":"2","author":"Xavier","year":"1994","journal-title":"Actas do X Encontro da Associa\u00e7\u00e3o Portuguesa de Lingu\u00edstica"},{"key":"S1351324919000378_ref72","first-page":"1","volume":"50","author":"Zubiaga","year":"2015","journal-title":"Tweetlid: a benchmark for tweet language identification"},{"key":"S1351324919000378_ref63","unstructured":"Teyssier, P. (1982). Hist\u00f3ria da l\u00edngua portuguesa."},{"key":"S1351324919000378_ref61","doi-asserted-by":"publisher","DOI":"10.1017\/S0332586517000130"},{"key":"S1351324919000378_ref60","doi-asserted-by":"publisher","DOI":"10.2200\/S00854ED1V01Y201805HLT039"},{"key":"S1351324919000378_ref58","doi-asserted-by":"publisher","DOI":"10.3115\/1626516.1626522"},{"key":"S1351324919000378_ref57","volume-title":"Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and \u201cmegalo\u201d-comparison) for the Subgrouping of Tibeto-Burman","author":"Satterthwaite-Phillips","year":"2011"},{"key":"S1351324919000378_ref53","volume-title":"Early English in the Computer Age: Explorations Through the Helsinki Corpus","volume":"11","author":"Rissanen","year":"1993"},{"key":"S1351324919000378_ref55","volume-title":"Hist\u00f3ria da literatura portuguesa","author":"Saraiva","year":"2001"},{"key":"S1351324919000378_ref51","volume-title":"Sequences in Language and Text","author":"Rama","year":"2015"},{"key":"S1351324919000378_ref50","unstructured":"Pichel, J.R. , Gamallo, P. and Alegria, I. (2018). Measuring language distance among historical varieties using perplexity. Application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 145\u2013155."},{"key":"S1351324919000378_ref41","unstructured":"Malmasi, S. , Zampieri, M. , Ljube\u0161i, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), Osaka, Japan, pp. 1\u201314."},{"key":"S1351324919000378_ref49","doi-asserted-by":"publisher","DOI":"10.1016\/j.physa.2010.02.004"},{"key":"S1351324919000378_ref40","doi-asserted-by":"publisher","DOI":"10.1007\/s11434-013-5711-8"},{"key":"S1351324919000378_ref26","volume-title":"The Story of English: How the English Language Conquered the World","author":"Gooden","year":"2009"},{"key":"S1351324919000378_ref34","doi-asserted-by":"publisher","DOI":"10.1098\/rsos.171504"},{"key":"S1351324919000378_ref43","volume-title":"Hist\u00f3ria de portugal","author":"Mattoso","year":"1994"},{"key":"S1351324919000378_ref45","doi-asserted-by":"publisher","DOI":"10.1353\/lan.2005.0078"},{"key":"S1351324919000378_ref17","volume-title":"Statistical identification of language","author":"Dunning","year":"1994"},{"key":"S1351324919000378_ref48","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0137041"},{"key":"S1351324919000378_ref30","unstructured":"J\u00e1grov\u00e1, K. , Stenger, I. , Marti, R. and Avgustinova, T. (2016). Lexical and orthographic distances between bulgarian, czech, polish, and russian: A comparative analysis of the most frequent nouns. In Language Use and Linguistic Structure: Proceedings of the Olomouc Linguistics Colloquium, pp. 401\u2013416."},{"key":"S1351324919000378_ref16","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511794339"},{"key":"S1351324919000378_ref11","volume-title":"Hist\u00f3ria de Portugal em datas","author":"Capelo","year":"1994"},{"key":"S1351324919000378_ref25","unstructured":"Gonz\u00e1lez, M. (2015). An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pp. 1\u20137."},{"key":"S1351324919000378_ref24","doi-asserted-by":"publisher","DOI":"10.1016\/j.physa.2013.08.075"},{"key":"S1351324919000378_ref4","doi-asserted-by":"publisher","DOI":"10.1075\/dia.30.2.01bar"},{"key":"S1351324919000378_ref35","doi-asserted-by":"publisher","DOI":"10.1007\/11575832_13"},{"key":"S1351324919000378_ref71","first-page":"1","volume-title":"Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)","author":"Zampieri","year":"2018"},{"key":"S1351324919000378_ref64","doi-asserted-by":"publisher","DOI":"10.3366\/E1749503208000075"},{"key":"S1351324919000378_ref12","unstructured":"Cavnar, W.B. , Trenkle, J.M. and John, M. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, USA, pp. 161\u2013175. https:\/\/www.bibsonomy.org\/bibtex\/2b2f4de70229df66d0ecb9b2e25844a61\/nosebrain"},{"key":"S1351324919000378_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-1208"},{"key":"S1351324919000378_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1213"},{"key":"S1351324919000378_ref28","doi-asserted-by":"publisher","DOI":"10.21814\/lm.10.1.263"},{"key":"S1351324919000378_ref56","volume-title":"Hist\u00f3ria concisa de Portugal","author":"Saraiva","year":"1978"},{"key":"S1351324919000378_ref59","doi-asserted-by":"publisher","DOI":"10.4324\/9780203435670"},{"key":"S1351324919000378_ref14","unstructured":"Degaetano-Ortlieb, S. , Kermes, H. , Khamis, A. and Teich, E. (2016). An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e)."},{"key":"S1351324919000378_ref9","doi-asserted-by":"publisher","DOI":"10.1515\/9783110305258"},{"key":"S1351324919000378_ref1","volume-title":"Los 1001 a\u00f1os de la lengua espa\u00f1ola","volume":"3","author":"Alatorre","year":"2002"},{"key":"S1351324919000378_ref46","unstructured":"Nerbonne, J. and Heeringa, W. (1997a). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pp. 11\u201318."},{"key":"S1351324919000378_ref7","doi-asserted-by":"publisher","DOI":"10.1093\/llc\/8.4.243"},{"key":"S1351324919000378_ref3","doi-asserted-by":"publisher","DOI":"10.1515\/LITY.2009.009"},{"key":"S1351324919000378_ref5","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1717729115"},{"key":"S1351324919000378_ref21","doi-asserted-by":"publisher","DOI":"10.1016\/j.physa.2017.05.011"},{"key":"S1351324919000378_ref68","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.1078"},{"key":"S1351324919000378_ref15","unstructured":"Degaetano-Ortlieb, S. and Teich, E. (2018). Using relative entropy for detection and analysis of periods of diachronic linguistic change. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 22\u201333."},{"key":"S1351324919000378_ref62","first-page":"452","article-title":"Lexicostatistic dating of prehistoric ethnic contacts","volume":"96","author":"Swadesh","year":"1952","journal-title":"Proceedings of the American Philosophical Society"},{"key":"S1351324919000378_ref8","doi-asserted-by":"publisher","DOI":"10.1098\/rsif.2014.0841"},{"key":"S1351324919000378_ref27","doi-asserted-by":"publisher","DOI":"10.1515\/FLIN.2008.331"},{"key":"S1351324919000378_ref18","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220210"},{"key":"S1351324919000378_ref20","unstructured":"Gamallo, P. , Alegria, I. , Pichel, J.R. and Agirrezabal, M. (2016). Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 170\u2013177."},{"key":"S1351324919000378_ref23","unstructured":"Gamallo, P. , Sotelo, S. and Pichel, J.R. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. In Workshop TweetLID: Twitter Language Identification Workshop at SEPLN 2014. Girona, Spain."},{"key":"S1351324919000378_ref19","unstructured":"Galves, C. and Faria, P. (2010). Tycho Brahe parsed corpus of historical Portuguese. http:\/\/www.tycho.iel.unicamp.br\/tycho\/corpus\/en\/index.html"},{"key":"S1351324919000378_ref44","doi-asserted-by":"publisher","DOI":"10.4324\/9781315728056"},{"key":"S1351324919000378_ref36","unstructured":"Kroon, M. , Medvedeva, M. and Plank, B. (2018). When simple n-gram models outperform syntactic approaches: Discriminating between dutch and flemish. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 244\u2013253."},{"key":"S1351324919000378_ref37","first-page":"15","volume-title":"International Conference on Applications of Natural Language to Information Systems","author":"Lai","year":"2018"},{"key":"S1351324919000378_ref42","unstructured":"Mastin, L. (2011). The history of english. Available at https:\/\/www.thehistoryofenglish.com\/history.html (accessed 10 July 2019)."},{"key":"S1351324919000378_ref39","doi-asserted-by":"publisher","DOI":"10.1093\/jole\/lzy006"},{"key":"S1351324919000378_ref10","doi-asserted-by":"publisher","DOI":"10.1524\/stuf.2008.0026"},{"key":"S1351324919000378_ref69","unstructured":"Zampieri, M. (2017). Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803."},{"key":"S1351324919000378_ref6","doi-asserted-by":"publisher","DOI":"10.4324\/9780203994634"},{"key":"S1351324919000378_ref38","unstructured":"Lapesa, R. and Pidal, R.M. (1942). Historia de la lengua espa\u00f1ola."},{"key":"S1351324919000378_ref32","doi-asserted-by":"publisher","DOI":"10.4324\/9780203068915"},{"key":"S1351324919000378_ref52","unstructured":"Rama, T. and Singh, A.K. (2009). From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pp. 355\u2013359."},{"key":"S1351324919000378_ref65","doi-asserted-by":"publisher","DOI":"10.1146\/annurev-linguist-030514-124930"},{"key":"S1351324919000378_ref47","unstructured":"Nerbonne, J. and Heeringa, W. (1997b). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON-97), pp. 11\u201318."},{"key":"S1351324919000378_ref31","unstructured":"Juri\u0107, D. (2013). The Historical Development of the English Spelling System. PhD Thesis, Josip Juraj Strossmayer University of Osijek. Faculty of Humanities and Social Sciences."},{"key":"S1351324919000378_ref54","volume-title":"The Short Oxford History of English Literature","author":"Sanders","year":"1994"},{"key":"S1351324919000378_ref33","unstructured":"Kloss, H. (1967). \u201cAbstand languages\u201d and \u201cAusbau languages\u201d. In Anthropological Linguistics, pp. 29\u201341."},{"key":"S1351324919000378_ref13","volume-title":"Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages","author":"Chiswick","year":"2004"},{"key":"S1351324919000378_ref29","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2018.04.005"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324919000378","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,6,10]],"date-time":"2020-06-10T13:26:34Z","timestamp":1591795594000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324919000378\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7,24]]},"references-count":72,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,7]]}},"alternative-id":["S1351324919000378"],"URL":"https:\/\/doi.org\/10.1017\/s1351324919000378","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,7,24]]}}}