{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T16:35:53Z","timestamp":1779381353813,"version":"3.53.1"},"reference-count":35,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2021,10,25]],"date-time":"2021-10-25T00:00:00Z","timestamp":1635120000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Education, Science, and Culture of Mecklenburg-Western Pomerania (Germany)","award":["ESF\/14-BM-A55-0006\/19"],"award-info":[{"award-number":["ESF\/14-BM-A55-0006\/19"]}]},{"DOI":"10.13039\/501100004895","name":"European Social Fund","doi-asserted-by":"publisher","award":["ESF\/14-BM-A55-0006\/19"],"award-info":[{"award-number":["ESF\/14-BM-A55-0006\/19"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Currently, the most widespread neural network architecture for training language models is the so-called BERT, which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants, such as ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention, which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks, of which two are introduced by this article.<\/jats:p>","DOI":"10.3390\/info12110443","type":"journal-article","created":{"date-parts":[[2021,10,25]],"date-time":"2021-10-25T21:40:21Z","timestamp":1635198021000},"page":"443","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Optimizing Small BERTs Trained for German NER"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3889-6629","authenticated-orcid":false,"given":"Jochen","family":"Z\u00f6llner","sequence":"first","affiliation":[{"name":"Institute of Mathematics, University of Rostock, 18057 Rostock, Germany"},{"name":"PLANET AI GmbH Rostock, 18057 Rostock, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3856-5878","authenticated-orcid":false,"given":"Konrad","family":"Sperfeld","sequence":"additional","affiliation":[{"name":"Institute of Mathematics, University of Rostock, 18057 Rostock, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3958-6240","authenticated-orcid":false,"given":"Christoph","family":"Wick","sequence":"additional","affiliation":[{"name":"PLANET AI GmbH Rostock, 18057 Rostock, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1901-9644","authenticated-orcid":false,"given":"Roger","family":"Labahn","sequence":"additional","affiliation":[{"name":"Institute of Mathematics, University of Rostock, 18057 Rostock, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,10,25]]},"reference":[{"key":"ref_1","unstructured":"(2021, October 22). NEISS Project Neuronal Extraction of Information, Structures and Symmetries in Images. Available online: https:\/\/www.neiss.uni-rostock.de\/en\/."},{"key":"ref_2","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, The MIT Press."},{"key":"ref_3","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_4","unstructured":"Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2020, January 26\u201330). Albert: A lite bert for self-supervised learning of language representations. Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia."},{"key":"ref_5","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv."},{"key":"ref_6","unstructured":"Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., and Le, Q.V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, The MIT Press."},{"key":"ref_7","unstructured":"(2021, October 22). Hugging Face. Available online: https:\/\/huggingface.co\/."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"3297","DOI":"10.21105\/joss.03297","article-title":"tfaip\u2014A Generic and Powerful Research Framework for Deep Learning based on Tensorflow","volume":"6","author":"Wick","year":"2021","journal-title":"J. Open Source Softw."},{"key":"ref_9","unstructured":"Attardi, G. (2020, February 15). WikiExtractor. Available online: https:\/\/github.com\/attardi\/wikiextractor."},{"key":"ref_10","unstructured":"Hamborg, F., Meuschke, N., Breitinger, C., and Gipp, B. (2017, January 13\u201315). news-please: A Generic News Crawler and Extractor. Proceedings of the 15th International Symposium of Information Science, Berlin, Germany."},{"key":"ref_11","unstructured":"Benikova, D., Biemann, C., Kisselew, M., and Pado, S. (2020, November 10). GermEval 2014 Named Entity Recognition Shared Task: Companion Paper. Available online: http:\/\/nbn-resolving.de\/urn:nbn:de:gbv:hil2-opus-3006."},{"key":"ref_12","unstructured":"Labusch, K., Neudecker, C., and Zellh\u00f6fer, D. (2019, January 8\u201311). BERT for Named Entity Recognition in Contemporary and Historic German. Proceedings of the 15th Conference on Natural Language Processing, Erlangen, Germany."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chan, B., Schweter, S., and M\u00f6ller, T. (2020). German\u2019s Next Language Model. arXiv.","DOI":"10.18653\/v1\/2020.coling-main.598"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Riedl, M., and Pad\u00f3, S. (2018, January 15\u201320). A Named Entity Recognition Shootout for German. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia. Volume 2: Short Papers.","DOI":"10.18653\/v1\/P18-2020"},{"key":"ref_15","unstructured":"Leitner, E., Rehm, G., and Moreno-Schneider, J. (2020). A Dataset of German Legal Documents for Named Entity Recognition. arXiv."},{"key":"ref_16","unstructured":"Hahn, B., Breysach, B., and Pischel, C. (2021, October 22). Hannah Arendt Digital. Kritische Gesamtausgabe. Sechs Essays. Available online: https:\/\/hannah-arendt-edition.net\/3p.html."},{"key":"ref_17","unstructured":"TEI-Consortium (2021, October 22). Guidelines for Electronic Text Encoding and Interchange. Available online: https:\/\/tei-c.org\/."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Buchholz, S., and Marsi, E. (2006, January 8\u20139). CoNLL-X Shared Task on Multilingual Dependency Parsing. Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), New York, NY, USA.","DOI":"10.3115\/1596276.1596305"},{"key":"ref_19","unstructured":"Schrade, M.T. (2021, October 22). DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde. Available online: https:\/\/sturm-edition.de\/id\/S.0000001."},{"key":"ref_20","unstructured":"Rosendahl, J., Tran, V.A.K., Wang, W., and Ney, H. (2019, January 2\u20133). Analysis of Positional Encodings for Neural Machine Translation. Proceedings of the IWSLT, Hong Kong, China."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Shaw, P., Uszkoreit, J., and Vaswani, A. (2018, January 1\u20136). Self-attention with relative position representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA.","DOI":"10.18653\/v1\/N18-2074"},{"key":"ref_22","unstructured":"Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., and Hu, G. (2019). Pre-training with whole word masking for chinese bert. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., and Yarowsky, D. (1999). Text Chunking Using Transformation-Based Learning. Natural Language Processing Using Very Large Corpora, Springer. Text, Speech and Language Technology.","DOI":"10.1007\/978-94-017-2390-9"},{"key":"ref_24","unstructured":"Sang, E.F., and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv."},{"key":"ref_25","unstructured":"Nakayama, H. (2021, October 22). Seqeval: A Python Framework for Sequence Labeling Evaluation. Available online: https:\/\/github.com\/chakki-works\/seqeval."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Luoma, J., and Pyysalo, S. (2020). Exploring Cross-sentence Contexts for Named Entity Recognition with BERT. arXiv.","DOI":"10.18653\/v1\/2020.coling-main.78"},{"key":"ref_27","unstructured":"Souza, F., Nogueira, R., and Lotufo, R. (2020). Portuguese Named Entity Recognition using BERT-CRF. arXiv."},{"key":"ref_28","unstructured":"Lafferty, J., McCallum, A., and Pereira, F. (July, January 28). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the International Conference on Machine Learning (ICML), Williamstown, MA, USA."},{"key":"ref_29","unstructured":"Sutton, C., and McCallum, A. (2021, October 22). An Introduction to Conditional Random Fields. Available online: https:\/\/homepages.inf.ed.ac.uk\/csutton\/publications\/crftutv2.pdf."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Lester, B., Pressel, D., Hemmeter, A., Ray Choudhury, S., and Bangalore, S. (2020, January 16\u201320). Constrained Decoding for Computationally Efficient Named Entity Recognition Taggers. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online.","DOI":"10.18653\/v1\/2020.findings-emnlp.166"},{"key":"ref_31","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","volume":"26","author":"Mikolov","year":"2013","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_32","unstructured":"Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., and Yang, L. (2020). Big bird: Transformers for longer sequences. arXiv."},{"key":"ref_33","unstructured":"Beltagy, I., Peters, M.E., and Cohan, A. (2020). Longformer: The long-document transformer. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Acosta, M., Cudr\u00e9-Mauroux, P., Maleshkova, M., Pellegrini, T., Sack, H., and Sure-Vetter, Y. (2019). Fine-Grained Named Entity Recognition in Legal Documents. Semantic Systems. The Power of AI and Knowledge Graphs, Springer International Publishing. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-030-33220-4"},{"key":"ref_35","unstructured":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/11\/443\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:22:57Z","timestamp":1760167377000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/12\/11\/443"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,25]]},"references-count":35,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2021,11]]}},"alternative-id":["info12110443"],"URL":"https:\/\/doi.org\/10.3390\/info12110443","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,25]]}}}