{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T16:47:27Z","timestamp":1781714847596,"version":"3.54.5"},"reference-count":41,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,8,13]],"date-time":"2025-08-13T00:00:00Z","timestamp":1755043200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>Foundational large language models (LLMs) are deployed in multilingual environments across a range of general and narrow task domains. These models generate text token by token, making them slower and more computationally expensive for low-resource languages that are underrepresented in the tokenizer vocabulary. It also makes their usage more costly in such cases, as pricing usually depends on the number of input and output tokens. This study compares multiple tokenizers of pretrained LLMs for the Ukrainian language. It also provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, as well as results of experiments with a transliteration approach to make tokenization more efficient without information loss. The results provide insights into the current models\u2019 disadvantages and possible problems in terms of Ukrainian language modeling.<\/jats:p>","DOI":"10.3389\/frai.2025.1538165","type":"journal-article","created":{"date-parts":[[2025,8,13]],"date-time":"2025-08-13T06:33:21Z","timestamp":1755066801000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Tokenization efficiency of current foundational large language models for the Ukrainian language"],"prefix":"10.3389","volume":"8","author":[{"given":"Daniil","family":"Maksymenko","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Oleksii","family":"Turuta","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2025,8,13]]},"reference":[{"key":"ref1","doi-asserted-by":"publisher","first-page":"14219","DOI":"10.48550\/ARXIV.2404.14219","article-title":"Phi-3 technical report: a highly capable language model locally on your phone","volume":"2024","author":"Abdin","year":"2024","journal-title":"arXiv"},{"key":"ref2","author":"Ahia","year":"2023"},{"key":"ref3","author":"Arnett","year":"2024"},{"key":"ref4","doi-asserted-by":"publisher","first-page":"13292","DOI":"10.48550\/ARXIV.2404.13292","article-title":"Evaluating subword tokenization: alien subword composition and OOV generalization challenge","volume":"1","author":"Batsuren","year":"2024","journal-title":"arXiv"},{"key":"ref5","author":"Budzianowski","year":"2019"},{"key":"ref6","doi-asserted-by":"publisher","first-page":"5741","DOI":"10.48550\/ARXIV.2311.05741","article-title":"Efficiently adapting pretrained language models to new languages","volume":"2023","author":"Csaki","year":"2023","journal-title":"arXiv"},{"key":"ref7","doi-asserted-by":"publisher","first-page":"21783","DOI":"10.48550\/ARXIV.2407.21783","article-title":"The llama 3 herd of models","volume":"2024","author":"Dubey","year":"2024","journal-title":"arXiv"},{"key":"ref8","doi-asserted-by":"publisher","first-page":"1131","DOI":"10.1613\/jair.1.12918","article-title":"Neural natural language generation: a survey on Multilinguality, multimodality, controllability and learning","volume":"73","author":"Erdem","year":"2022","journal-title":"J. Artif. Intell. Res."},{"key":"ref9","author":"Gall\u00e9","year":"2019"},{"key":"ref10","doi-asserted-by":"publisher","first-page":"752","DOI":"10.48550\/ARXIV.2312.00752","article-title":"Mamba: linear-time sequence modeling with selective state spaces","volume":"2023","author":"Gu","year":"2023","journal-title":"arXiv"},{"key":"ref11","doi-asserted-by":"publisher","first-page":"11644","DOI":"10.48550\/ARXIV.2306.11644","article-title":"Textbooks are all you need","volume":"2023","author":"Gunasekar","year":"2023","journal-title":"arXiv"},{"key":"ref12","doi-asserted-by":"publisher","first-page":"825","DOI":"10.48550\/ARXIV.2310.06825","article-title":"Mistral 7B","volume":"2023","author":"Jiang","year":"2023","journal-title":"arXiv"},{"key":"ref13","doi-asserted-by":"publisher","first-page":"4088","DOI":"10.48550\/ARXIV.2401.04088","article-title":"Mixtral of experts","volume":"2024","author":"Jiang","year":"2024","journal-title":"arXiv"},{"key":"ref14","doi-asserted-by":"publisher","first-page":"647","DOI":"10.48550\/ARXIV.2404.03647","article-title":"Capabilities of large language models in control engineering: a benchmark study on GPT-4, Claude 3 opus, and Gemini 1.0 ultra","volume":"2024","author":"Kevian","year":"2024","journal-title":"arXiv"},{"key":"ref15","author":"Kudo","year":"2018"},{"key":"ref16","author":"Kudo","year":"2018"},{"key":"ref17","doi-asserted-by":"publisher","first-page":"11878","DOI":"10.48550\/ARXIV.2308.11878","article-title":"Cabrita: closing the gap for foreign languages","volume":"2023","author":"Larcher","year":"2023","journal-title":"arXiv"},{"key":"ref18","author":"Limisiewicz","year":"2023"},{"key":"ref19","author":"Maksymenko","year":"2023"},{"key":"ref20","author":"Maksymenko","year":"2022"},{"key":"ref21","year":"2024"},{"key":"ref22","author":"Marchisio","year":"2023"},{"key":"ref23","year":"2024"},{"key":"ref24","doi-asserted-by":"publisher","first-page":"774","DOI":"10.48550\/ARXIV.2303.08774","article-title":"GPT-4 technical report","volume":"2023","author":"Achiam","year":"2023","journal-title":"arXiv"},{"key":"ref25","doi-asserted-by":"publisher","first-page":"5","DOI":"10.32620\/reks.2023.4.01","article-title":"Ensemble machine learning approaches for fake news classification","volume":"4","author":"Padalko","year":"2023","journal-title":"Radioelectron. Comput. Syst."},{"key":"ref26","author":"Peters","year":"2022"},{"key":"ref27","doi-asserted-by":"publisher","first-page":"15425","DOI":"10.48550\/ARXIV.2305.15425","article-title":"Language model tokenizers introduce unfairness between languages","volume":"2023","author":"Petrov","year":"2023","journal-title":"arXiv"},{"key":"ref28","author":"Rust","year":"2021"},{"key":"ref29","author":"Sachidananda","year":"2021"},{"key":"ref30","author":"Saichyshyna","year":"2023"},{"key":"ref31","author":"Starko","year":"2023"},{"key":"ref32","author":"Su\u00e1rez","year":"2020"},{"key":"ref33","author":"Syvokon","year":"2023"},{"key":"ref34","doi-asserted-by":"publisher","first-page":"11805","DOI":"10.48550\/ARXIV.2312.11805","article-title":"Gemini: a family of highly capable multimodal models","volume":"2023","author":"Team","year":"2023","journal-title":"arXiv"},{"key":"ref35","doi-asserted-by":"publisher","first-page":"295","DOI":"10.48550\/ARXIV.2403.08295","article-title":"Gemma: open models based on Gemini research and technology","volume":"2024","author":"Team","year":"2024","journal-title":"arXiv"},{"key":"ref36","doi-asserted-by":"publisher","first-page":"118","DOI":"10.48550\/ARXIV.2408.00118","article-title":"Gemma 2: improving open language models at a practical size","volume":"2024","author":"Team","year":"2024","journal-title":"arXiv"},{"key":"ref37","year":"2025"},{"key":"ref38","doi-asserted-by":"publisher","first-page":"9288","DOI":"10.48550\/ARXIV.2307.09288","article-title":"Llama 2: open foundation and fine-tuned chat models","volume":"2023","author":"Touvron","year":"2023","journal-title":"arXiv"},{"key":"ref39","year":"2024"},{"key":"ref40","doi-asserted-by":"publisher","first-page":"671","DOI":"10.48550\/ARXIV.2407.10671","article-title":"Qwen2 Technical Report","volume":"2024","author":"Yang","year":"2024","journal-title":"arXiv"},{"key":"ref41","author":"Yenduri","year":"2023"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1538165\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,13]],"date-time":"2025-08-13T06:33:24Z","timestamp":1755066804000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1538165\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,13]]},"references-count":41,"alternative-id":["10.3389\/frai.2025.1538165"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1538165","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,13]]},"article-number":"1538165"}}