{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T06:37:40Z","timestamp":1778049460537,"version":"3.51.4"},"reference-count":47,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,10,18]],"date-time":"2024-10-18T00:00:00Z","timestamp":1729209600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Artificial Intelligence (AI) has the potential to revolutionise the medical and healthcare sectors. AI and related technologies could significantly address some supply-and-demand challenges in the healthcare system, such as medical AI assistants, chatbots and robots. This paper focuses on tailoring LLMs to medical data utilising a Retrieval-Augmented Generation (RAG) database to evaluate their performance in a computationally resource-constrained environment. Existing studies primarily focus on fine-tuning LLMs on medical data, but this paper combines RAG and fine-tuned models and compares them against base models using RAG or only fine-tuning. Open-source LLMs (Flan-T5-Large, LLaMA-2-7B, and Mistral-7B) are fine-tuned using the medical datasets Meadow-MedQA and MedMCQA. Experiments are reported for response generation and multiple-choice question answering. The latter uses two distinct methodologies: Type A, as standard question answering via direct choice selection; and Type B, as language generation and probability confidence score generation of choices available. Results in the medical domain revealed that Fine-tuning and RAG are crucial for improved performance, and that methodology Type A outperforms Type B.<\/jats:p>","DOI":"10.3390\/make6040116","type":"journal-article","created":{"date-parts":[[2024,10,22]],"date-time":"2024-10-22T07:53:43Z","timestamp":1729583623000},"page":"2355-2374","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":47,"title":["Systematic Analysis of Retrieval-Augmented Generation-Based LLMs for Medical Chatbot Applications"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-0662-3323","authenticated-orcid":false,"given":"Arunabh","family":"Bora","sequence":"first","affiliation":[{"name":"School of Engineering and Physical Sciences, University of Lincoln, Brayford Pool, Lincoln LN6 7TS, Lincolnshire, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1937-9837","authenticated-orcid":false,"given":"Heriberto","family":"Cuay\u00e1huitl","sequence":"additional","affiliation":[{"name":"School of Engineering and Physical Sciences, University of Lincoln, Brayford Pool, Lincoln LN6 7TS, Lincolnshire, UK"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/s42979-023-02551-0","article-title":"Robotics in Healthcare: A Survey","volume":"5","year":"2024","journal-title":"SN Comput. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1038\/s41591-018-0300-7","article-title":"High-performance medicine: The convergence of human and artificial intelligence","volume":"25","author":"Topol","year":"2019","journal-title":"Nat. Med."},{"key":"ref_3","unstructured":"Toukmaji, C., and Tee, A. (2024, January 25\u201327). Retrieval-Augmented Generation and LLM Agents for Biomimicry Design Solutions. Proceedings of the AAAI Spring Symposium Series (SSS-24), Stanford, CA, USA."},{"key":"ref_4","unstructured":"Zeng, F., Gan, W., Wang, Y., Liu, N., and Yu, P.S. (2023). Large Language Models for Robotics: A Survey. arXiv."},{"key":"ref_5","unstructured":"Vaswani, A. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA."},{"key":"ref_6","unstructured":"Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., and Saulnier, L. (2023). Mistral 7B. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ni, J., Qu, C., Lu, J., Dai, Z., \u00c1brego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., and Chang, M.-W. (2021). Large Dual Encoders Are Generalizable Retrievers. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-main.669"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_9","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. arXiv."},{"key":"ref_10","unstructured":"Wolfe, C.R. (2024, June 07). LLaMA-2 from the Ground Up. Available online: https:\/\/cameronrwolfe.substack.com\/p\/llama-2-from-the-ground-up."},{"key":"ref_11","unstructured":"Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., and Yu, T. (2023, January 23\u201329). PaLM-E: An Embodied Multimodal Language Model. Proceedings of the 40th International Conference on Machine Learning (ICML\u201923), Honolulu, HI, USA."},{"key":"ref_12","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv."},{"key":"ref_13","unstructured":"B\u00e9chard, P., and Ayala, O.M. (2024). Reducing hallucination in structured outputs via Retrieval-Augmented Generation. arXiv."},{"key":"ref_14","unstructured":"Banerjee, S., Agarwal, A., and Singla, S. (2024). LLMs Will Always Hallucinate, and We Need to Live with This. arXiv."},{"key":"ref_15","unstructured":"Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv."},{"key":"ref_16","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00fcttler, H., Lewis, M., Yih, W.-t., and Rockt\u00e4schel, T. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., and Keutzer, K. (2021). A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv.","DOI":"10.1201\/9781003162810-13"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"e188","DOI":"10.7861\/fhj.2021-0095","article-title":"Artificial intelligence in healthcare: Transforming the practice of medicine","volume":"8","author":"Bajwa","year":"2021","journal-title":"Future Healthc. J."},{"key":"ref_19","unstructured":"Pal, A., Umapathi, L.K., and Sankarasubbu, M. (2022). MedMCQA: A Large-Scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. arXiv."},{"key":"ref_20","first-page":"1","article-title":"Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing","volume":"3","author":"Gu","year":"2022","journal-title":"ACM Trans. Health Inform."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J.A., Wornow, M., Swaminathan, A., and Lehmann, L.S. (2024). A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs). medRxiv, 2024.04.15.24305869.","DOI":"10.1101\/2024.04.15.24305869"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ge, J., Sun, S., Owens, J., Galvez, V., Gologorskaya, O., Lai, J.C., Pletcher, M.J., and Lai, K. (2023). Development of a Liver Disease-Specific Large Language Model Chat Interface Using Retrieval Augmented Generation. medRxiv, 2023.11.10.23298364.","DOI":"10.1101\/2023.11.10.23298364"},{"key":"ref_23","unstructured":"Ramjee, P., Sachdeva, B., Golechha, S., Kulkarni, S., Fulari, G., Murali, K., and Jain, M. (2024). CataractBot: An LLM-Powered Expert-in-the-Loop Chatbot for Cataract Patients. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"100943","DOI":"10.1016\/j.patter.2024.100943","article-title":"Can large language models reason about medical questions?","volume":"5","author":"Hother","year":"2024","journal-title":"Patterns"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1109\/MIC.2020.3037151","article-title":"Chatbots as Conversational Healthcare Services","volume":"25","author":"Baez","year":"2021","journal-title":"IEEE Internet Comput."},{"key":"ref_26","unstructured":"Zhou, H., Liu, F., Gu, B., Zou, X., Huang, J., Wu, J., Li, Y., Chen, S.S., Zhou, P., and Liu, J. (2024). A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. arXiv."},{"key":"ref_27","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P.J. (2023). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv."},{"key":"ref_28","unstructured":"Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., and Brahma, S. (2022). Scaling Instruction-Finetuned Language Models. arXiv."},{"key":"ref_29","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and Bhosale, S. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Gao, Y., Liu, Y., Zhang, H., Li, Z., Zhu, Y., Lin, H., and Yang, M. (2020, January 8\u201313). Estimating GPU Memory Consumption of Deep Learning Models. Proceedings of the ACM, Virtual.","DOI":"10.1145\/3368089.3417050"},{"key":"ref_31","unstructured":"Jeon, H., Kim, Y., and Kim, J.-J. (2024). L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models. arXiv."},{"key":"ref_32","unstructured":"Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv."},{"key":"ref_33","unstructured":"Xu, Y., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. (2023). QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models. arXiv."},{"key":"ref_34","unstructured":"Christophe, C., Kanithi, P.K., Munjal, P., Raha, T., Hayat, N., Rajan, R., Al-Mahrooqi, A., Gupta, A., Salman, M.U., and Gosal, G. (2024). Med42\u2014Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches. arXiv."},{"key":"ref_35","unstructured":"Han, T., Adams, L.C., Papaioannou, J.-M., Grundmann, P., Oberhauser, T., L\u00f6ser, A., Truhn, D., and Bressem, K.K. (2023). MedAlpaca\u2014An Open-Source Collection of Medical Conversational AI Models and Training Data. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., and Lu, X. (2019). PubMedQA: A Dataset for Biomedical Research Question Answering. arXiv.","DOI":"10.18653\/v1\/D19-1259"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Abacha, A.B., and Demner-Fushman, D. (2019). A Question-Entailment Approach to Question Answering. BMC Bioinform., 20.","DOI":"10.1186\/s12859-019-3119-4"},{"key":"ref_38","unstructured":"Hu, T., and Zhou, X.-H. (2024). Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 6\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_40","unstructured":"Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, Association for Computational Linguistics."},{"key":"ref_41","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_42","unstructured":"Zhou, J. (2024). QOG: Question and Options Generation based on Language Model. arXiv."},{"key":"ref_43","unstructured":"Wu, J., Zhu, J., and Qi, Y. (2024). Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv."},{"key":"ref_44","unstructured":"Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., and Neal, D. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. arXiv."},{"key":"ref_45","unstructured":"Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., and Dong, Z. (2023). A Survey of Large Language Models. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Mhatre, A., Warhade, S.R., Pawar, O., Kokate, S., Jain, S., and Emmanuel, M. (2024). Leveraging LLM: Implementing an Advanced AI Chatbot for Healthcare. Int. J. Innov. Sci. Res. Technol., 9.","DOI":"10.38124\/ijisrt\/IJISRT24MAY1964"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","article-title":"Large language models encode clinical knowledge","volume":"620","author":"Singhal","year":"2023","journal-title":"Nature"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/116\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:16:12Z","timestamp":1760112972000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/116"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,18]]},"references-count":47,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["make6040116"],"URL":"https:\/\/doi.org\/10.3390\/make6040116","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,18]]}}}