{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T04:22:10Z","timestamp":1772252530457,"version":"3.50.1"},"reference-count":26,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2023,11,10]],"date-time":"2023-11-10T00:00:00Z","timestamp":1699574400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Leibniz Association"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.<\/jats:p>","DOI":"10.3390\/data8110170","type":"journal-article","created":{"date-parts":[[2023,11,13]],"date-time":"2023-11-13T02:10:43Z","timestamp":1699841443000},"page":"170","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8893-8153","authenticated-orcid":false,"given":"Sascha","family":"Wolfer","sequence":"first","affiliation":[{"name":"Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9630-9680","authenticated-orcid":false,"given":"Alexander","family":"Koplenig","sequence":"additional","affiliation":[{"name":"Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8997-8256","authenticated-orcid":false,"given":"Marc","family":"Kupietz","sequence":"additional","affiliation":[{"name":"Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany"}]},{"given":"Carolin","family":"M\u00fcller-Spitzer","sequence":"additional","affiliation":[{"name":"Leibniz Institute for the German Language (IDS), 68161 Mannheim, Germany"}]}],"member":"1968","published-online":{"date-parts":[[2023,11,10]]},"reference":[{"key":"ref_1","unstructured":"Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., and Mazo, H. (2018, January 7). The German Reference Corpus DeReKo: New Developments\u2014New Opportunities. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_2","unstructured":"Chapelle, C.A. (2019). The Encyclopedia of Applied Linguistics, Wiley."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"176","DOI":"10.1126\/science.1199644","article-title":"Quantitative Analysis of Culture Using Millions of Digitized Books","volume":"331","author":"Michel","year":"2011","journal-title":"Science"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Pechenick, E.A., Danforth, C.M., and Dodds, P.S. (2015). Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0137041"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"e2115010118","DOI":"10.1073\/pnas.2115010118","article-title":"Uncontrolled Corpus Composition Drives an Apparent Surge in Cognitive Distortions","volume":"118","author":"Schmidt","year":"2021","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_6","unstructured":"Jurafsky, D., and Martin, J.H. (2023). Speech and Language Processing, [3rd ed.]. Available online: https:\/\/web.stanford.edu\/~jurafsky\/slp3\/."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"862","DOI":"10.1037\/0278-7393.31.5.862","article-title":"Effects of Contextual Predictability and Transitional Probability on Eye Movements During Reading","volume":"31","author":"Frisson","year":"2005","journal-title":"J. Exp. Psychol. Learn. Mem. Cogn."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1080\/09541440340000213","article-title":"Length, Frequency, and Predictability Effects of Words on Eye Movements in Reading","volume":"16","author":"Kliegl","year":"2004","journal-title":"Eur. J. Cogn. Psychol."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1090","DOI":"10.1016\/j.clinph.2003.12.020","article-title":"Effects of Word Length and Frequency on the Human Event-Related Potential","volume":"115","author":"Hauk","year":"2004","journal-title":"Clin. Neurophysiol."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1037\/a0040332","article-title":"Distinct ERP Signatures of Word Frequency, Phrase Frequency, and Prototypicality in Speech Production","volume":"43","author":"Hendrix","year":"2017","journal-title":"J. Exp. Psychol. Learn. Mem. Cogn."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"e13090","DOI":"10.1111\/cogs.13090","article-title":"Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus","volume":"46","author":"Koplenig","year":"2022","journal-title":"Cogn. Sci."},{"key":"ref_12","unstructured":"Klosa-K\u00fcckelhaus, A., Engelberg, S., M\u00f6hrs, C., and Storjohann, P. (2022, January 12\u201316). Tokenizing on Scale. Preprocessing Large Text Corpora on the Lexical and Sentence Level. Proceedings of the Dictionaries and Society, Proceedings of the XX EURALEX International Congress, Mannheim, Germany."},{"key":"ref_13","unstructured":"Schmid, H. (1994, January 6\u20138). Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK."},{"key":"ref_14","unstructured":"Brants, T., Popat, A.C., Xu, P., Och, F.J., and Dean, J. (2007, January 28\u201330). Large Language Models in Machine Translation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Aumasson, J.-P., Meier, W., Phan, R.C.-W., and Henzen, L. (2014). The Hash Function BLAKE, Springer.","DOI":"10.1007\/978-3-662-44757-4"},{"key":"ref_16","unstructured":"Schiller, A., Teufel, S., St\u00f6ckert, C., and Thielen, C. (1999). Guidelines f\u00fcr das Tagging Deutscher Textcorpora mit STTS, Institut f\u00fcr Maschinelle Sprachverarbeitung, Universit\u00e4t Stuttgart."},{"key":"ref_17","unstructured":"Jackson, W. (1953). Communication Theory, Butterworths Scientific Publications."},{"key":"ref_18","unstructured":"Zipf, G.K. (1935). The Psycho-Biology of Language, Houghton, Mifflin."},{"key":"ref_19","unstructured":"Evert, S., and Baroni, M. (2007, January 25\u201327). zipfR: Word Frequency Distributions in R. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, Prague, Czech Republic."},{"key":"ref_20","first-page":"455","article-title":"The Effects of Lexical Specialization on the Growth Curve of the Vocabulary","volume":"22","author":"Baayen","year":"1996","journal-title":"Comput. Linguist."},{"key":"ref_21","unstructured":"Bl\u00fchdorn, H., Elstermann, M., and Klosa, A. (2014). Die Erstellung der Basislemmaliste der Neuhochdeutschen Standardsprache aus Mehrfach Linguistisch Annotierten Korpora, Institut f\u00fcr Deutsche Sprache."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1116","DOI":"10.3389\/fpsyg.2016.01116","article-title":"How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant\u2019s Age","volume":"7","author":"Brysbaert","year":"2016","journal-title":"Front. Psychol."},{"key":"ref_23","unstructured":"Herdan, G. (1964). Quantitative Linguistics, Butterworths."},{"key":"ref_24","unstructured":"Heaps, H.S. (1978). Information Retrieval, Computational and Theoretical Aspects, Academic Press. Library and Information Science."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1075\/ijcl.20.1.02mil","article-title":"Evaluating Reliability in Quantitative Vocabulary Studies: The Influence of Corpus Design and Composition","volume":"20","author":"Miller","year":"2015","journal-title":"Int. J. Corpus Linguist."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1080\/01690969408402127","article-title":"Productivity in Language Production","volume":"9","author":"Baayen","year":"1994","journal-title":"Lang. Cogn. Process."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/11\/170\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:21:13Z","timestamp":1760131273000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/11\/170"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,10]]},"references-count":26,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2023,11]]}},"alternative-id":["data8110170"],"URL":"https:\/\/doi.org\/10.3390\/data8110170","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3139640\/v1","asserted-by":"object"}]},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,10]]}}}