{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T02:20:10Z","timestamp":1775787610863,"version":"3.50.1"},"reference-count":32,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2019,5,3]],"date-time":"2019-05-03T00:00:00Z","timestamp":1556841600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Recently, it was demonstrated that generalized entropies of order \u03b1 offer novel and important opportunities to quantify the similarity of symbol sequences where \u03b1 is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf\u2019s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.<\/jats:p>","DOI":"10.3390\/e21050464","type":"journal-article","created":{"date-parts":[[2019,5,7]],"date-time":"2019-05-07T03:15:46Z","timestamp":1557198946000},"page":"464","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9630-9680","authenticated-orcid":false,"given":"Alexander","family":"Koplenig","sequence":"first","affiliation":[{"name":"Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sascha","family":"Wolfer","sequence":"additional","affiliation":[{"name":"Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carolin","family":"M\u00fcller-Spitzer","sequence":"additional","affiliation":[{"name":"Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,5,3]]},"reference":[{"key":"ref_1","unstructured":"Manning, C.D., and Sch\u00fctze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press."},{"key":"ref_2","unstructured":"Jurafsky, D., and Martin, J.H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Pearson Education (US)."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"20150230","DOI":"10.1098\/rsta.2015.0230","article-title":"What is information?","volume":"374","author":"Adami","year":"2016","journal-title":"Philos. Trans. R. Soc. A"},{"key":"ref_4","unstructured":"Cover, T.M., and Thomas, J.A. (2006). Elements of information theory, Wiley-Interscience. [2nd ed.]."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Bentz, C., Alikaniotis, D., Cysouw, M., and Ferrer-i-Cancho, R. (2017). The Entropy of Words\u2014Learnability and Expressivity across More than 1000 Languages. Entropy, 19.","DOI":"10.20944\/preprints201704.0180.v1"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1858","DOI":"10.1109\/TIT.2003.813506","article-title":"A new metric for probability distributions","volume":"49","author":"Endres","year":"2003","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"7682","DOI":"10.1073\/pnas.1115407109","article-title":"Quantitative patterns of stylistic influence in the evolution of literature","volume":"109","author":"Hughes","year":"2012","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"9419","DOI":"10.1073\/pnas.1405984111","article-title":"The civilizing process in London\u2019s Old Bailey","volume":"111","author":"Klingenstein","year":"2014","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2246","DOI":"10.3390\/e15062246","article-title":"Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems","volume":"15","author":"DeDeo","year":"2013","journal-title":"Entropy"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"20140841","DOI":"10.1098\/rsif.2014.0841","article-title":"Universals versus historical contingencies in lexical evolution","volume":"11","author":"Bochkarev","year":"2014","journal-title":"J. R. Soc. Interface"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1080\/09296174.2017.1311447","article-title":"A Data-Driven Method to Identify (Correlated) Changes in Chronological Corpora","volume":"24","author":"Koplenig","year":"2017","journal-title":"J. Quant. Linguist."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Pechenick, E.A., Danforth, C.M., and Dodds, P.S. (2015). Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLOS ONE.","DOI":"10.1371\/journal.pone.0137041"},{"key":"ref_14","unstructured":"Zipf, G.K. (1935). The Psycho-biology of Language. An Introduction to Dynamic Philology, Houghton Mifflin Company."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1515\/cllt-2014-0049","article-title":"Using the parameters of the Zipf\u2013Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes\u2013a large-scale corpus analysis","volume":"14","author":"Koplenig","year":"2018","journal-title":"Corpus Linguist. Linguist. Theory"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Baayen, R.H. (2001). Word Frequency Distributions, Kluwer Academic Publishers.","DOI":"10.1007\/978-94-010-0844-0"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"323","DOI":"10.1023\/A:1001749303137","article-title":"How Variable May a Constant be? Measures of Lexical Richness in Perspective","volume":"32","author":"Tweedie","year":"1998","journal-title":"Comput. Hum."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1111\/j.2517-6161.1951.tb00088.x","article-title":"The Interpretation of Interaction in Contingency Tables","volume":"13","author":"Simpson","year":"1951","journal-title":"J. R. Stat. Soc. Series B"},{"key":"ref_19","first-page":"021006","article-title":"Stochastic Model for the Vocabulary Growth in Natural Languages","volume":"3","author":"Gerlach","year":"2013","journal-title":"Phys. Rev. X"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"052311","DOI":"10.1103\/PhysRevA.79.052311","article-title":"Properties of classical and quantum Jensen-Shannon divergence","volume":"79","year":"2009","journal-title":"Phys. Rev. A"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"014002","DOI":"10.1088\/1742-5468\/aa53f5","article-title":"Generalized entropies and the similarity of texts","volume":"2017","author":"Altmann","year":"2017","journal-title":"J. Stat. Mech. Theory Exp."},{"key":"ref_22","first-page":"021009","article-title":"Similarity of Symbol Frequency Distributions with Heavy Tails","volume":"6","author":"Gerlach","year":"2016","journal-title":"Phys. Rev. X"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"481","DOI":"10.1162\/COLI_a_00228","article-title":"Computational Constancy Measures of Texts\u2014Yule\u2019s K and R\u00e9nyi\u2019s Entropy","volume":"41","author":"Aihara","year":"2015","journal-title":"Comput. Linguist."},{"key":"ref_24","unstructured":"R\u00e9nyi, A. (July, January 20). On Measures of Entropy and Information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1211","DOI":"10.1109\/TSP.2003.810305","article-title":"A generalized divergence measure for robust image registration","volume":"51","author":"He","year":"2003","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_26","unstructured":"Schmid, H. (,  1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK."},{"key":"ref_27","unstructured":"H\u0159eb\u00ed\u010dek, L., and Altmann, G. (1993). Dynamic aspects of text characteristics. Quantitative Text Analysis, WVT Wissenschaftlicher Verlag Trier. Quantitative linguistics."},{"key":"ref_28","unstructured":"Popescu, I.-I., and Altmann, G. (2009). Word Frequency Studies, Mouton de Gruyter. Quantitative linguistics."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1076\/jqul.6.1.1.4148","article-title":"Review Article: On Vocabulary Richness","volume":"6","author":"Wimmer","year":"1999","journal-title":"J. Quant. Linguist."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"176","DOI":"10.1126\/science.1199644","article-title":"Quantitative Analysis of Culture Using Millions of Digitized Books","volume":"331","author":"Michel","year":"2010","journal-title":"Science"},{"key":"ref_31","unstructured":"Lin, Y., Michel, J.-B., Aiden, L.E., Orwant, J., Brockmann, W., and Petrov, S. (2012, January 8\u201314). Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea."},{"key":"ref_32","unstructured":"Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., and Moreno, A. (2018, January 7\u201312). The German Reference Corpus DeReKo: New Developments\u2013New Opportunities. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/5\/464\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T12:48:56Z","timestamp":1760186936000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/5\/464"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,5,3]]},"references-count":32,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2019,5]]}},"alternative-id":["e21050464"],"URL":"https:\/\/doi.org\/10.3390\/e21050464","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,5,3]]}}}