{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T12:18:21Z","timestamp":1771935501713,"version":"3.50.1"},"reference-count":36,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,5,2]],"date-time":"2025-05-02T00:00:00Z","timestamp":1746144000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Union\u2019s Horizon 2020 Research and Innovation Program","doi-asserted-by":"publisher","award":["956501"],"award-info":[{"award-number":["956501"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>This paper introduces a novel framework for addressing domain adaptation challenges in large language models (LLMs), emphasising privacy-preserving synthetic data generation and efficient fine-tuning. The proposed framework employs a multi-stage approach that includes document ingestion, relevance assessment, and automated dataset creation. This process reduces the need for extensive technical expertise while safeguarding data privacy. We evaluate the framework\u2019s performance on domain-specific tasks in fields such as biobanking and public health, demonstrating that models fine-tuned using our method achieve results comparable to larger proprietary models. Crucially, these models maintain their general instruction-following capabilities, even when adapted to specialised domains, as shown through experiments with 7B and 8B parameter LLMs. Key components of the framework include continuous pre-training, supervised fine-tuning (SFT), and reinforcement learning methods such as direct preference optimisation (DPO), which together provide a flexible and configurable solution for deploying LLMs. The framework supports both local models and API-based solutions, making it scalable and accessible. By enabling privacy-preserving, domain-specific adaptation without requiring extensive expertise, this framework represents a significant step forward in the deployment of LLMs for specialised applications. The framework significantly lowers the barrier to domain adaptation for small- and medium-sized enterprises (SMEs), enabling them to utilise the power of LLMs without requiring extensive resources or technical expertise.<\/jats:p>","DOI":"10.3390\/computers14050172","type":"journal-article","created":{"date-parts":[[2025,5,2]],"date-time":"2025-05-02T07:44:58Z","timestamp":1746171898000},"page":"172","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6829-0441","authenticated-orcid":false,"given":"George","family":"Balaskas","sequence":"first","affiliation":[{"name":"Institute of Informatics and Telecommunications, NCSR Demokritos, Ag. Paraskevi, 153 41 Athens, Greece"},{"name":"Department of Digital Systems, University of Piraeus, Karaoli ke Dimitriou, 185 34 Pireas, Greece"}]},{"given":"Homer","family":"Papadopoulos","sequence":"additional","affiliation":[{"name":"Institute of Informatics and Telecommunications, NCSR Demokritos, Ag. Paraskevi, 153 41 Athens, Greece"},{"name":"Syndesis Ltd., Ag. Paraskevi, 153 41 Athens, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4506-6690","authenticated-orcid":false,"given":"Dimitra","family":"Pappa","sequence":"additional","affiliation":[{"name":"Institute of Informatics and Telecommunications, NCSR Demokritos, Ag. Paraskevi, 153 41 Athens, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0287-3908","authenticated-orcid":false,"given":"Quentin","family":"Loisel","sequence":"additional","affiliation":[{"name":"School of Health and Life Sciences, Glasgow Caledonian University, Cowcaddens Rd., Glasgow G4 0BA, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1421-9348","authenticated-orcid":false,"given":"Sebastien","family":"Chastin","sequence":"additional","affiliation":[{"name":"School of Health and Life Sciences, Glasgow Caledonian University, Cowcaddens Rd., Glasgow G4 0BA, UK"},{"name":"Department of Movement and Sports Science, Ghent University, BE-9000 Ghent, Belgium"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,2]]},"reference":[{"key":"ref_1","unstructured":"Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., and Zhou, E. (2023). The rise and potential of large language model based agents: A survey. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhu, W., Liu, H., Dong, Q., Xu, J., Huang, S., Kong, L., Chen, J., and Li, L. (2023). Multilingual machine translation with large language models: Empirical results and analysis. arXiv.","DOI":"10.18653\/v1\/2024.findings-naacl.176"},{"key":"ref_3","unstructured":"Meta (2024). Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date, Meta. Technical Report."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Pang, B., Nijkamp, E., Kry\u015bci\u0144ski, W., Savarese, S., Zhou, Y., and Xiong, C. (2022). Long document summarization with top-down and bottom-up inference. arXiv.","DOI":"10.18653\/v1\/2023.findings-eacl.94"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1162\/tacl_a_00530","article-title":"Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering","volume":"11","author":"Siriwardhana","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_6","unstructured":"Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., and Clark, A. (2022). Training compute-optimal large language models. arXiv."},{"key":"ref_7","unstructured":"Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C.C.T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., and Saarikivi, O. (2023). Textbooks are all you need. arXiv."},{"key":"ref_8","unstructured":"Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., and Lee, Y.T. (2023). Textbooks are all you need ii: Phi-1.5 technical report. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Liu, Z., Zhong, A., Li, Y., Yang, L., Ju, C., Wu, Z., Ma, C., Shu, P., Chen, C., and Kim, S. (2023). Tailoring large language models to radiology: A preliminary approach to llm adaptation for a highly specialized domain. International Workshop on Machine Learning in Medical Imaging, Springer.","DOI":"10.1007\/978-3-031-45673-2_46"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Tan, Y., Zhang, Z., Li, M., Pan, F., Duan, H., Huang, Z., Deng, H., Yu, Z., Yang, C., and Shen, G. (2024). MedChatZH: A tuning LLM for traditional Chinese medicine consultations. Comput. Biol. Med., 172.","DOI":"10.1016\/j.compbiomed.2024.108290"},{"key":"ref_11","unstructured":"Cui, J., Li, Z., Yan, Y., Chen, B., and Yuan, L. (2023). Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv."},{"key":"ref_12","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ahmad, M.A., Yaramis, I., and Roy, T.D. (2023). Creating trustworthy llms: Dealing with hallucinations in healthcare ai. arXiv.","DOI":"10.20944\/preprints202310.1662.v1"},{"key":"ref_14","unstructured":"Zhang, T., Patil, S.G., Jain, N., Shen, S., Zaharia, M., Stoica, I., and Gonzalez, J.E. (2024). Raft: Adapting language model to domain specific rag. arXiv."},{"key":"ref_15","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., and Funtowicz, M. (2019). Huggingface\u2019s transformers: State-of-the-art natural language processing. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., and Stoica, I. (2023, January 23\u201326). Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany.","DOI":"10.1145\/3600006.3613165"},{"key":"ref_18","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_19","unstructured":"Anthropic PBC (2025, February 01). Anthropic, Claude API. Available online: https:\/\/docs.anthropic.com\/en\/release-notes\/api."},{"key":"ref_20","unstructured":"Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., and Bi, X. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv."},{"key":"ref_21","unstructured":"Groq, Inc. (2025, February 01). Groq API. Available online: https:\/\/console.groq.com\/docs\/overview."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"e45059","DOI":"10.2196\/45059","article-title":"Establishing a Health CASCADE\u2013Curated Open-Access database to consolidate knowledge about Co-creation: Novel Artificial intelligence\u2013assisted methodology based on systematic reviews","volume":"25","author":"Agnello","year":"2023","journal-title":"J. Med. Internet Res."},{"key":"ref_23","unstructured":"Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., and Zhang, Y. (2023). An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv."},{"key":"ref_24","unstructured":"Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., and Habib, N. (2023). Zephyr: Direct distillation of lm alignment. arXiv."},{"key":"ref_25","first-page":"53728","article-title":"Direct preference optimization: Your language model is secretly a reward model","volume":"36","author":"Rafailov","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","unstructured":"Blecher, L., Cucurull, G., Scialom, T., and Stojnic, R. (2023). Nougat: Neural optical understanding for academic documents. arXiv."},{"key":"ref_27","unstructured":"Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., and Bressand, F. (2024). Mixtral of experts. arXiv."},{"key":"ref_28","unstructured":"Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., and Saulnier, L. (2023). Mistral 7B. arXiv."},{"key":"ref_29","unstructured":"Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback. arXiv."},{"key":"ref_30","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv."},{"key":"ref_31","unstructured":"Lin, C.Y. (2004, January 25\u201326). Rouge: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_33","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. (2023). Agieval: A human-centric benchmark for evaluating foundation models. arXiv.","DOI":"10.18653\/v1\/2024.findings-naacl.149"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., and Jacobsen, H.A. (2013, January 22\u201327). Bigbench: Towards an industry standard benchmark for big data analytics. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA.","DOI":"10.1145\/2463676.2463712"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv.","DOI":"10.18653\/v1\/2022.acl-long.229"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/172\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:26:15Z","timestamp":1760030775000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/172"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,2]]},"references-count":36,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5]]}},"alternative-id":["computers14050172"],"URL":"https:\/\/doi.org\/10.3390\/computers14050172","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,2]]}}}