{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T02:12:13Z","timestamp":1775787133166,"version":"3.50.1"},"reference-count":30,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T00:00:00Z","timestamp":1747699200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Science and Higher Education of the Republic of Kazakhstan","award":["BR24993001"],"award-info":[{"award-number":["BR24993001"]}]},{"name":"Creation of a large language model (LLM) to maintain the implementation of Kazakh language and increase the technological progress","award":["BR24993001"],"award-info":[{"award-number":["BR24993001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Low-resource languages remain underserved by contemporary large language models (LLMs) because they lack sizable corpora, bespoke preprocessing tools, and the computing budgets assumed by mainstream alignment pipelines. Focusing on Kazakh, we present a 1.94B parameter LLaMA-based model that demonstrates how strong, culturally aligned performance can be achieved without massive infrastructure. The contribution is threefold. (i) Data and tokenization\u2014we compile a rigorously cleaned, mixed-domain Kazakh corpus and design a tokenizer that respects the language\u2019s agglutinative morphology, mixed-script usage, and diacritics. (ii) Training recipe\u2014the model is built in two stages: causal language modeling from scratch followed by instruction tuning. Alignment is further refined with Direct Preference Optimization (DPO), extended by contrastive and entropy-based regularization to stabilize training under sparse, noisy preference signals. Two complementary resources support this step: ChatTune-DPO, a crowd-sourced set of human preference pairs, and Pseudo-DPO, an automatically generated alternative that repurposes instruction data to reduce annotation cost. (iii) Evaluation and impact\u2014qualitative and task-specific assessments show that targeted monolingual training and the proposed DPO variant markedly improve factuality, coherence, and cultural fidelity over baseline instruction-only and multilingual counterparts. The model and datasets are released under open licenses, offering a reproducible blueprint for extending state-of-the-art language modeling to other under-represented languages and domains.<\/jats:p>","DOI":"10.3390\/bdcc9050137","type":"journal-article","created":{"date-parts":[[2025,5,20]],"date-time":"2025-05-20T08:41:12Z","timestamp":1747730472000},"page":"137","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization"],"prefix":"10.3390","volume":"9","author":[{"given":"Nurgali","family":"Kadyrbek","sequence":"first","affiliation":[{"name":"Department of AI & Big Data, Faculty of Information Technologies, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan"}]},{"given":"Zhanseit","family":"Tuimebayev","sequence":"additional","affiliation":[{"name":"Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan"}]},{"given":"Madina","family":"Mansurova","sequence":"additional","affiliation":[{"name":"Department of AI & Big Data, Faculty of Information Technologies, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8929-1574","authenticated-orcid":false,"given":"V\u00edtor","family":"Viegas","sequence":"additional","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Universidade de Aveiro, 1049-001 Lisbon, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kozov, V., Ivanova, B., Shoylekova, K., and Andreeva, M. (2024). Analyzing the Impact of a Structured LLM Workshop in Different Education Levels. Appl. Sci., 14.","DOI":"10.3390\/app14146280"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Wang, Q., and Li, H. (2025). On Continually Tracing Origins of LLM-Generated Text and Its Application in Detecting Cheating in Student Coursework. Big Data Cogn. Comput., 9.","DOI":"10.3390\/bdcc9030050"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Huang, D., Yan, C., Li, Q., and Peng, X. (2024). From Large Language Models to Large Multimodal Models: A Literature Review. Appl. Sci., 14.","DOI":"10.3390\/app14125068"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"268","DOI":"10.18355\/XL.2024.17.02.18","article-title":"Axiological Approach as a Factor of University Curriculum Language","volume":"17","author":"Kuznetsova","year":"2024","journal-title":"XLinguae"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Papageorgiou, E., Chronis, C., Varlamis, I., and Himeur, Y. (2024). A Survey on the Use of Large Language Models (LLMs) in Fake News. Future Internet, 16.","DOI":"10.3390\/fi16080298"},{"key":"ref_6","unstructured":"Pelofske, E., Urias, V., and Liebrock, L.M. (2024). Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kamshat, A., Auyeskhan, U., Zarina, N., and Alen, S. (2024, January 3\u20134). Integration AI Techniques in Low-Resource Language: The Case of Kazakh Language. Proceedings of the 2024 IEEE AITU: Digital Generation, Astana, Kazakhstan.","DOI":"10.1109\/IEEECONF61558.2024.10585350"},{"key":"ref_8","unstructured":"Li, Z., Shi, Y., Liu, Z., Yang, F., Liu, N., and Du, M. (2024). Quantifying multilingual performance of large language models across languages. arXiv, Available online: https:\/\/arxiv.org\/html\/2404.11553v2."},{"key":"ref_9","unstructured":"(2024, August 18). Karde\u015f-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin\u2014A Benchmark and Evaluation for Turkic Languages. Available online: https:\/\/aclanthology.org\/2024.eacl-long.100."},{"key":"ref_10","unstructured":"Ataman, D., Derin, M.O., Ivanova, S., K\u00f6ksal, A., S\u00e4lev\u00e4, J., and Zeyrek, D. (2024, September 22). Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024), Available online: https:\/\/aclanthology.org\/2024.sigturk-1."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ding, B., Qin, C., Zhao, R., Luo, T., Li, X., Chen, G., Xia, W., Hu, J., Luu, A.T., and Joty, S. (2024). Data Augmentation Using Large Language Models: Data Perspectives, Learning Paradigms and Challenges. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.97"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Jiang, F., Xu, Z., Niu, L., Lin, B.Y., and Poovendran, R. (2024). ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates. arXiv.","DOI":"10.1609\/aaai.v39i26.34945"},{"key":"ref_13","unstructured":"Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., and Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv."},{"key":"ref_14","unstructured":"(2024, August 13). Farabi-Lab\/Kazakh Text for Language Modeling\u2014Normalized Dataset; Hugging Face, 2024. Available online: https:\/\/huggingface.co\/datasets\/farabi-lab\/kaz-text-for-lm-normalized."},{"key":"ref_15","unstructured":"(2024, August 14). Farabi-Lab\/Kazakh Wikipedia Dumps Cleaned Dataset; Hugging Face, 2024. Available online: https:\/\/huggingface.co\/datasets\/farabi-lab\/wiki_kk."},{"key":"ref_16","unstructured":"Nurgali, K. (2025, February 09). llama-1.9B-kaz-instruct Hugging Face, 2025. Available online: https:\/\/huggingface.co\/nur-dev\/llama-1.9B-kaz-instruct."},{"key":"ref_17","unstructured":"(2025, January 19). Farabi Lab\/KazNU-Lib-OCR-for-LM Dataset; Hugging Face: 2024. Available online: https:\/\/huggingface.co\/datasets\/farabi-lab\/kaznu-lib-ocr-for-lm."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1446","DOI":"10.3390\/ai5030069","article-title":"Prompt Engineering for Knowledge Creation: Using Chain-of-Thought to Support Students\u2019 Improvable Ideas","volume":"5","author":"Lee","year":"2024","journal-title":"AI"},{"key":"ref_19","unstructured":"Nurgali, K. (2025, February 09). ChatTune-DPO Hugging Face, 2025. Available online: https:\/\/huggingface.co\/datasets\/farabi-lab\/user-feedback-dpo."},{"key":"ref_20","unstructured":"Nurgali, K. (2025, January 12). Instruct-KZ-RL Hugging Face, 2025. Available online: https:\/\/huggingface.co\/datasets\/nur-dev\/kaz-instruct-rl."},{"key":"ref_21","unstructured":"Luo, J., Luo, X., Chen, X., Xiao, Z., Ju, W., and Zhang, M. (2024). Semi-supervised Fine-tuning for Large Language Models. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.3390\/ai5030066","article-title":"Optimizing Curriculum Vitae Concordance: A Comparative Examination of Classical Machine Learning Algorithms and Large Language Model Architectures","volume":"5","author":"Maree","year":"2024","journal-title":"AI"},{"key":"ref_23","unstructured":"Nurgali, K. (2025, February 10). llama-1.9B-kaz; Hugging Face, 2025. Available online: https:\/\/huggingface.co\/nur-dev\/llama-1.9B-kaz."},{"key":"ref_24","unstructured":"Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. Computation and Language 2021. arXiv."},{"key":"ref_25","unstructured":"Feng, D., Qin, B., Huang, C., Zhang, Z., and Lei, W. (2024). Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective. arXiv."},{"key":"ref_26","unstructured":"(2025, February 16). AI Forever, Kazakh mGPT 1.3B, Hugging Face, 2024. Available online: https:\/\/huggingface.co\/ai-forever\/mGPT-1.3B-kazakh."},{"key":"ref_27","unstructured":"(2025, February 16). ISSAI, LLama-3.1-KazLLM-1.0-8B, Hugging Face, 2024. Available online: https:\/\/huggingface.co\/issai\/LLama-3.1-KazLLM-1.0-8B."},{"key":"ref_28","unstructured":"Kadyrbek, N. (2025, April 30). QThink-Task: A Task-Level Benchmark for Evaluating Kazakh Language Models; Hugging Face, 2025. Available online: https:\/\/huggingface.co\/datasets\/nur-dev\/QThink-Task."},{"key":"ref_29","unstructured":"Dam, S.K., Hong, C.S., Qiao, Y., and Zhang, C. (2024). A Complete Survey on LLM-based AI Chatbots. arXiv."},{"key":"ref_30","unstructured":"Kadyrbek, N. (2025, April 30). Raw Text for CLM V1 (Farabi-Lab\/Raw-Text-for-Clm-V1); Hugging Face: 2024. Available online: https:\/\/huggingface.co\/datasets\/farabi-lab\/raw-text-for-clm-v1."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/5\/137\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:35:45Z","timestamp":1760031345000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/5\/137"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,20]]},"references-count":30,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5]]}},"alternative-id":["bdcc9050137"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9050137","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,20]]}}}