{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T06:17:44Z","timestamp":1770877064240,"version":"3.50.1"},"reference-count":62,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Given the knowledge-intensive and rapidly expanding nature of medical field, accurately synthesizing and interpreting findings remain a major challenge for clinicians and medical students. Although Large Language Models (LLMs) have advanced automated summarization or generated responses, their deployment is limited by hallucinations, outdated knowledge, and insufficient domain adaptation. Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLMs in external knowledge bases. However, as the document corpus scales, maintaining RAG accuracy becomes increasingly difficult, making retrievers critical for contextual relevance. In this paper, we examined the efficiency of a modular RAG framework with a hybrid retrieval strategy that combines sparse retrieval (BM25) and dense retrieval (MedCPT) to extract the most relevant documents from the corpus, thereby providing contextual grounding for the LLM to improve medical responses. Evaluation was conducted on three benchmark healthcare datasets: PubMedQA, MedMCQA, and MedQA-US, using two LLMs, GPT-4o and BioGPT. Performance was assessed using retrieval metrics (context precision, context recall, F1-score) and generation metrics (BERTScore, RAG Assessment Score). The hybrid retriever achieved 92.14% recall, 74.36% precision, and an F1-score of 82.30%. GPT-4o with hybrid retrieval reached 89.4% faithfulness, 82.7% answer relevancy, and an F1BERT of 88.0% on PubMedQA. Results demonstrated that hybrid retrieval within a modular architecture substantially improves retrieval effectiveness and response quality. The proposed work offers a scalable, generalizable solution for high-stakes healthcare applications, supporting flexible retriever integration and robust evaluation to advance transparent QA systems.<\/jats:p>","DOI":"10.3390\/info17020133","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T09:00:33Z","timestamp":1770022833000},"page":"133","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing Medical Question Answering with LLMs via a Hybrid Retrieval-Augmented Generation Framework"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-5748-2372","authenticated-orcid":false,"given":"Bushra","family":"Aljohani","sequence":"first","affiliation":[{"name":"Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2630-5972","authenticated-orcid":false,"given":"Tawfeeq","family":"Alsanoosy","sequence":"additional","affiliation":[{"name":"Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1007\/s10916-024-02045-3","article-title":"The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives","volume":"48","author":"Cascella","year":"2024","journal-title":"J. Med. Syst."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1002\/hcs2.61","article-title":"Large language models in health care: Development, applications, and challenges","volume":"2","author":"Yang","year":"2023","journal-title":"Health Care Sci."},{"key":"ref_3","unstructured":"Dou, C., Zhang, Y., Chen, Y., Jin, Z., Jiao, W., Zhao, H., and Huang, Y. (2024, January 20\u201325). Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medial Hallucination Evaluation. Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), Torino, Italy."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1801","DOI":"10.1093\/jamia\/ocae202","article-title":"Large language models in biomedicine and health: Current research landscape and future directions","volume":"31","author":"Lu","year":"2024","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., and Liu, Z. (2024, January 9\u201311). Evaluation of retrieval-augmented generation: A survey. Proceedings of the CCF Conference on Big Data, Qingdao, China.","DOI":"10.1007\/978-981-96-1024-2_8"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Kim, Y., Jeong, H., Chen, S., Li, S.S., Park, C., Lu, M., and Breazeal, C. (2025). Medical hallucinations in foundation models and their impact on healthcare. arXiv.","DOI":"10.1101\/2025.02.28.25323115"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Omar, M., Sorin, V., Collins, J.D., Reich, D., Freeman, R., Gavin, N., and Klang, E. (2025). Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support: A Multi-Model Assurance Analysis. medRxiv.","DOI":"10.1101\/2025.03.18.25324184"},{"key":"ref_8","unstructured":"Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., and Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv."},{"key":"ref_9","unstructured":"Ilin, I. (2025, May 02). Advanced RAG Techniques: An Illustrated Overview. Towards AI. Available online: https:\/\/pub.towardsai.net\/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., and Li, Q. (2024, January 25\u201329). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain.","DOI":"10.1145\/3637528.3671470"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1007\/s44336-024-00009-2","article-title":"A survey on LLM-based multi-agent systems: Workflow, infrastructure, and challenges","volume":"1","author":"Li","year":"2024","journal-title":"Vicinagearth"},{"key":"ref_12","unstructured":"Raja, M., and Yuvaraajan, E. (2024, January 24\u201326). A rag-based medical assistant especially for infectious diseases. Proceedings of the 2024 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal."},{"key":"ref_13","unstructured":"Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\u00fcttler, H., Lewis, M., Yih, W.-T., and Rockt\u00e4schel, T. (2020, January 6\u201312). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), Online. Article 793."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Glass, M., Rossiello, G., Chowdhury, M.F.M., Naik, A.R., Cai, P., and Gliozzo, A. (2022). Re2G: Retrieve, rerank, generate. arXiv.","DOI":"10.18653\/v1\/2022.naacl-main.194"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"104662","DOI":"10.1016\/j.jbi.2024.104662","article-title":"Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records","volume":"156","author":"Alkhalaf","year":"2024","journal-title":"J. Biomed. Inform."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"i119","DOI":"10.1093\/bioinformatics\/btae238","article-title":"Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models","volume":"40","author":"Jeong","year":"2024","journal-title":"Bioinformatics"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Taleb, I., Navaz, A.N., and Serhani, M.A. (2024). Leveraging Large Language Models for Enhancing Literature-Based Discovery. Big Data Cogn. Comput., 8.","DOI":"10.3390\/bdcc8110146"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Neha, F., Bhati, D., and Shukla, D.K. (2025). Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review. AI, 6.","DOI":"10.20944\/preprints202508.1022.v1"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bunnell, D.J., Bondy, M.J., Fromtling, L.M., Ludeman, E., and Gourab, K. (2025). Bridging AI and Healthcare: A Scoping Review of Retrieval-Augmented Generation\u2014Ethics, Bias, Transparency, Improvements, and Applications. medRxiv.","DOI":"10.1101\/2025.04.01.25325033"},{"key":"ref_20","unstructured":"Garg, M., Wang, L.C., Ghanchi, B., Dumpala, S., Kakde, S., and Chen, Y.C. (2025). Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG). arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Huang, J., Wang, M., Cui, Y., Liu, J., Chen, L., Wang, T., and Wu, J. (2024). Layered Query Retrieval: An Adaptive Framework for Retrieval-Augmented Generation in Complex Question Answering for Large Language Models. Appl. Sci., 14.","DOI":"10.3390\/app142311014"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Xu, K., Zhang, K., Li, J., Huang, W., and Wang, Y. (2024). CRP-RAG: A Retrieval-Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning. Electronics, 14.","DOI":"10.20944\/preprints202411.1648.v1"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Bornea, A.L., Ayed, F., Domenico, A.D., Piovesan, N., and Maatouk, A. (2024). Telco-RAG: Navigating the challenges of retrieval-augmented language models for telecommunications. arXiv.","DOI":"10.1109\/GLOBECOM52923.2024.10901158"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Lozano, A., Fleming, S.L., Chiang, C.C., and Shah, N. (2024, January 3\u20137). Clinfo.ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific literature. Proceedings of the Pacific Symposium on Biocomputing 2024, Honolulu, HI, USA.","DOI":"10.1142\/9789811286421_0002"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"btad651","DOI":"10.1093\/bioinformatics\/btad651","article-title":"MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval","volume":"39","author":"Jin","year":"2023","journal-title":"Bioinformatics"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Joshi, M., Choi, E., Weld, D.S., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv.","DOI":"10.18653\/v1\/P17-1147"},{"key":"ref_28","unstructured":"Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., and Nguyen, T. (2016). Ms marco: A human generated machine reading comprehension dataset. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1264","DOI":"10.1177\/19322968241253568","article-title":"Building Trustworthy Generative Artificial Intelligence for Diabetes Care and Limb Preservation: A Medical Knowledge Extraction Case","volume":"19","author":"Mashatian","year":"2024","journal-title":"J. Diabetes Sci. Technol."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"599","DOI":"10.1177\/0145721707305880","article-title":"National standards for diabetes self-management education","volume":"33","author":"Funnell","year":"2007","journal-title":"Diabetes Educ."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1162\/tacl_a_00530","article-title":"Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering","volume":"11","author":"Siriwardhana","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhou, Q., Liu, C., Duan, Y., Sun, K., Li, Y., Kan, H., and Hu, J. (2024). GastroBot: A Chinese gastrointestinal disease chatbot based on the retrieval-augmented generation. Front. Med., 11.","DOI":"10.3389\/fmed.2024.1392555"},{"key":"ref_33","unstructured":"Guinet, G., Omidvar-Tehrani, B., Deoras, A., and Callot, L. (2024). Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"102938","DOI":"10.1016\/j.artmed.2024.102938","article-title":"MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering","volume":"155","author":"Alonso","year":"2024","journal-title":"Artif. Intell. Med."},{"key":"ref_35","unstructured":"Gao, Y., Xiong, Y., Wang, M., and Wang, H. (2024). Modular rag: Transforming rag systems into lego-like reconfigurable frameworks. arXiv."},{"key":"ref_36","first-page":"776","article-title":"Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model","volume":"15","author":"Muludi","year":"2024","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"28191","DOI":"10.1007\/s00521-025-11666-9","article-title":"A survey on retrieval-augmentation generation (RAG) models for healthcare applications","volume":"37","author":"Saad","year":"2025","journal-title":"Neural Comput. Appl."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"e80557","DOI":"10.2196\/80557","article-title":"Improving Large Language Model Applications in the Medical and Nursing Domains with Retrieval-Augmented Generation: Scoping Review","volume":"27","author":"Miao","year":"2025","journal-title":"J. Med. Internet Res."},{"key":"ref_39","unstructured":"Gupta, S., Ranjan, R., and Singh, S.N. (2024). A comprehensive survey of retrieval-augmented generation (rag): Evolution, current landscape and future directions. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Amugongo, L.M., Mascheroni, P., Brooks, S., Doering, S., and Seidel, J. (2025). Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health, 4.","DOI":"10.1371\/journal.pdig.0000877"},{"key":"ref_41","unstructured":"Jin, M., Yu, Q., Shu, D., Zhang, C., Fan, L., Hua, W., and Zhang, Y. (2024). Health-LLM: Personalized Retrieval-Augmented Disease Prediction System. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1038\/s44401-024-00004-1","article-title":"Retrieval-augmented generation for generative artificial intelligence in health care","volume":"2","author":"Yang","year":"2025","journal-title":"npj Health Syst."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Sawarkar, K., Mangal, A., and Solanki, S.R. (2024, January 7\u20139). Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA.","DOI":"10.1109\/MIPR62202.2024.00031"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., Ai, G., and Lu, X. (2019, January 3\u20137). PubMedQA: A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1259"},{"key":"ref_45","unstructured":"Pal, A., Umapathi, L.K., and Sankarasubbu, M. (2022, January 7\u20138). Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. Proceedings of the Conference on Health, Inference, and Learning, Virtual."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H., and Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci., 11.","DOI":"10.20944\/preprints202105.0498.v1"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1561\/1500000019","article-title":"The probabilistic relevance framework: BM25 and beyond","volume":"3","author":"Robertson","year":"2009","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"bbac409","DOI":"10.1093\/bib\/bbac409","article-title":"BioGPT: Generative pre-trained transformer for biomedical text generation and mining","volume":"23","author":"Luo","year":"2022","journal-title":"Brief Bioinf."},{"key":"ref_49","unstructured":"Hosseini, P., Sin, J.M., Ren, B., Thomas, B.G., Nouri, E., Farahanchi, A., and Hassanpour, S. (2024). A benchmark for long-form medical question answering. arXiv."},{"key":"ref_50","unstructured":"Tang, X., Shao, D., Sohn, J., Chen, J., Zhang, J., Xiang, J., Wu, F., Zhao, Y., Wu, C., and Shi, W. (2025). Medagentsbench: Benchmarking thinking models and agent frameworks for complex medical reasoning. arXiv."},{"key":"ref_51","unstructured":"Ammar, A., Koubaa, A., Nacar, O., and Boulila, W. (2025). Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2024, January 17\u201322). RAGAS: Automated Evaluation of Retrieval Augmented Generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julian\u2019s, Malta.","DOI":"10.18653\/v1\/2024.eacl-demo.16"},{"key":"ref_53","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1080\/01621459.1927.10502953","article-title":"Probable inference, the law of succession, and statistical inference","volume":"22","author":"Wilson","year":"1927","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_56","first-page":"72888","article-title":"Repetition in repetition out: Towards understanding neural text degeneration from the data perspective","volume":"36","author":"Li","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Xiong, G., Jin, Q., Lu, Z., and Zhang, A. (2024, January 11\u201316). Benchmarking retrieval-augmented generation for medicine. Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.findings-acl.372"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Xie, Q., Schenck, E.J., Yang, H.S., Chen, Y., Peng, Y., and Wang, F. (2023). Faithful AI in medicine: A systematic review with large language models and beyond. medRxiv.","DOI":"10.21203\/rs.3.rs-3661764\/v1"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"1964","DOI":"10.1093\/jamia\/ocae131","article-title":"Reasoning with large language models for medical question answering","volume":"31","author":"Lucas","year":"2024","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1038\/s41746-024-01283-6","article-title":"Evaluation and mitigation of cognitive biases in medical language models","volume":"7","author":"Schmidgall","year":"2024","journal-title":"npj Digit. Med."},{"key":"ref_61","unstructured":"Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.Q. (2017, January 6\u201311). On calibration of modern neural networks. Proceedings of the International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_62","unstructured":"Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2019). The curious case of neural text degeneration. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/2\/133\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T05:24:55Z","timestamp":1770873895000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/2\/133"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,1]]},"references-count":62,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["info17020133"],"URL":"https:\/\/doi.org\/10.3390\/info17020133","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,1]]}}}