{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T02:25:15Z","timestamp":1772677515156,"version":"3.50.1"},"reference-count":49,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,11,23]],"date-time":"2025-11-23T00:00:00Z","timestamp":1763856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"JST","award":["JPMJPR23P7"],"award-info":[{"award-number":["JPMJPR23P7"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model\u2019s certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Results demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (0.95) and MedMCQA (0.78). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.<\/jats:p>","DOI":"10.3390\/bdcc9120299","type":"journal-article","created":{"date-parts":[[2025,11,24]],"date-time":"2025-11-24T10:24:17Z","timestamp":1763979857000},"page":"299","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["CURE: Confidence-Driven Unified Reasoning Ensemble Framework for Medical Question Answering"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8807-020X","authenticated-orcid":false,"given":"Ziad","family":"Elshaer","sequence":"first","affiliation":[{"name":"School of Information Technology and Computer Science, Nile University, Giza 12588, Egypt"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6571-9807","authenticated-orcid":false,"given":"Essam A.","family":"Rashed","sequence":"additional","affiliation":[{"name":"Graduate School of Information Science, University of Hyogo, Kobe 650-0047, Japan"},{"name":"Advanced Medical Engineering Research Institute, University of Hyogo, Himeji 670-0836, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,23]]},"reference":[{"key":"ref_1","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_2","unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1038\/s41591-018-0316-z","article-title":"A guide to deep learning in healthcare","volume":"25","author":"Esteva","year":"2019","journal-title":"Nat. Med."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1038\/s41746-018-0029-1","article-title":"Scalable and accurate deep learning with electronic health records","volume":"1","author":"Rajkomar","year":"2018","journal-title":"npj Digit. Med."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","article-title":"Large language models in medicine","volume":"29","author":"Thirunavukarasu","year":"2023","journal-title":"Nat. Med."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H., and Szolovits, P. (2021). What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci., 11.","DOI":"10.20944\/preprints202105.0498.v1"},{"key":"ref_7","unstructured":"Pal, A., Umapathi, L.K., and Sankarasubbu, M. (2022, January 7\u20138). MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. Proceedings of the Conference on Health, Inference, and Learning, Virtual."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., and Lu, X. (2019). PubMedQA: A dataset for biomedical research question answering. arXiv.","DOI":"10.18653\/v1\/D19-1259"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","article-title":"Large language models encode clinical knowledge","volume":"620","author":"Singhal","year":"2023","journal-title":"Nature"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1038\/s44387-025-00021-x","article-title":"Healthcare agent: Eliciting the power of large language models for medical consultation","volume":"1","author":"Ren","year":"2025","journal-title":"npj Artif. Intell."},{"key":"ref_11","unstructured":"Chen, Z., Hernandez, M., Sait, A., Borges, G.T., Ceballos, R., Agrawal, A., Jaiswal, A., Wang, Z., Ding, R., and Agarwal, A. (2023). Meditron-70b: Scaling medical pretraining for large language models. arXiv."},{"key":"ref_12","first-page":"e40895","article-title":"Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge","volume":"15","author":"Li","year":"2023","journal-title":"Cureus"},{"key":"ref_13","unstructured":"Xiong, H., Wang, S., Zhu, Y., Zhao, Z., Liu, Y., Wang, Q., and Shen, D. (2024). DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Chen, G., Li, J., Wu, X., Zhang, Z., and Xiao, Q. (2023, January 6\u201310). HuatuoGPT, Towards Taming Language Model to Be a Doctor. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore.","DOI":"10.18653\/v1\/2023.findings-emnlp.725"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Jiang, D., Ren, X., and Lin, B.Y. (2023). LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. arXiv.","DOI":"10.18653\/v1\/2023.acl-long.792"},{"key":"ref_16","unstructured":"Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., and Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Yu, H., Yu, C., Wang, Z., Zou, D., and Qin, H. (2024, January 26\u201328). Enhancing healthcare through large language models: A study on medical question answering. Proceedings of the 2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China.","DOI":"10.1109\/ICPICS62053.2024.10797141"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"102938","DOI":"10.1016\/j.artmed.2024.102938","article-title":"MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering","volume":"155","author":"Alonso","year":"2024","journal-title":"Artif. Intell. Med."},{"key":"ref_19","unstructured":"Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y., Solomon, J., Thompson, N., and Venkatachalam, M. (2023). Large Language Model Routing with Benchmark Datasets. arXiv."},{"key":"ref_20","unstructured":"Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Sharan, N., Parmar, N., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"9076","DOI":"10.1038\/s41467-025-64142-2","article-title":"LINS: A general medical Q&A framework for enhancing the quality and credibility of LLM-generated responses","volume":"16","author":"Wang","year":"2025","journal-title":"Nat. Commun."},{"key":"ref_22","unstructured":"Wu, J., Xie, K., Gu, B., Kr\u00fcger, N., Lin, K.J., and Yang, J. (2025). Why Chain of Thought Fails in Clinical Text Understanding. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Nagar, A., Schlegel, V., Nguyen, T.-T., Li, H., Wu, Y., Binici, K., and Winkler, S. (2024). LLMs Are Not Zero-Shot Reasoners for Biomedical Information Extraction. arXiv.","DOI":"10.18653\/v1\/2025.insights-1.11"},{"key":"ref_24","unstructured":"Chen, Q., and Liu, D. (2023). Dynamic Strategy Chain: Dynamic Zero-Shot CoT for Long Mental Health Support Generation. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Cheng, X., Pan, C., Zhao, M., Li, D., Liu, F., Zhang, X., Zhang, X., and Liu, Y. (2025). Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot. arXiv.","DOI":"10.18653\/v1\/2025.findings-emnlp.729"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wan, X., Sun, R., Dai, H., Arik, S., and Pfister, T. (2023, January 9\u201314). Better Zero-Shot Reasoning with Self-Adaptive Prompting. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada.","DOI":"10.18653\/v1\/2023.findings-acl.216"},{"key":"ref_27","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","first-page":"22199","article-title":"Large language models are zero-shot reasoners","volume":"35","author":"Kojima","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"12013","DOI":"10.1007\/s00521-025-11162-0","article-title":"Mitigating exposure bias in large language model distillation: An imitation learning approach","volume":"37","author":"Pozzi","year":"2025","journal-title":"Neural Comput. Appl."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"110996","DOI":"10.1016\/j.knosys.2023.110996","article-title":"A contrastive framework for enhancing Knowledge Graph Question Answering: Alleviating exposure bias","volume":"280","author":"Du","year":"2023","journal-title":"Knowl.-Based Syst."},{"key":"ref_31","unstructured":"Schmidgall, S., Harris, C., Essien, I., Olshvang, D., Rahman, T., Kim, J.W., Ziaei, R., Eshraghian, J., Abadir, P., and Chellappa, R. (2024). Addressing cognitive bias in medical language models. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Wang, H., Li, Y., Zhao, W., Zhu, H., Zhang, J., and Wu, X. (2025). GSF-LLM: Graph-Enhanced Spatio-Temporal Fusion-Based Large Language Model for Traffic Prediction. Sensors, 25.","DOI":"10.3390\/s25216698"},{"key":"ref_33","unstructured":"Nori, H., King, N., McKinney, S.M., Carignan, D., and Horvitz, E. (2023). Capabilities of gpt-4 on medical challenge problems. arXiv."},{"key":"ref_34","unstructured":"Qwen Team (2025, October 05). Qwen3-30B-A3B-Instruct-2507-FP8. Available online: https:\/\/huggingface.co\/Qwen\/Qwen3-30B-A3B-Instruct-2507-FP8."},{"key":"ref_35","unstructured":"Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., and Kauffmann, P. (2024). Phi-4 Technical Report. arXiv."},{"key":"ref_36","unstructured":"Gemma Team, and Google DeepMind (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv."},{"key":"ref_37","unstructured":"Singhal, K., Hassankhani, A., Singhal, A., Narayanan, D., Newman, M., Chen, M., Chen, X., He, X., Zhang, Y., and Wang, X. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. arXiv."},{"key":"ref_38","first-page":"101449","article-title":"Can large language models reason about medical questions?","volume":"5","author":"Hother","year":"2024","journal-title":"Cell Rep. Med."},{"key":"ref_39","unstructured":"Google DeepMind Team (2025). MedGemma Technical Report. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Yang, H., Chen, H., Guo, H., Chen, Y., Lin, C., Hu, S., Hu, J., Wu, X., and Wang, X. (2025). LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models. arXiv.","DOI":"10.1109\/IJCNN64981.2025.11228647"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"e70080","DOI":"10.2196\/70080","article-title":"Large language model synergy for ensemble learning in medical question answering: Design and evaluation study","volume":"27","author":"Yang","year":"2025","journal-title":"J. Med. Internet Res."},{"key":"ref_42","unstructured":"Tang, X., Shao, D., Sohn, J., Chen, J., Zhang, J., Xiang, J., Wu, F., Zhao, Y., Wu, C., and Shi, W. (2025). MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning. arXiv."},{"key":"ref_43","unstructured":"BioLlama Team (2025, November 20). OpenBioLLM: Scaling Open-Source Medical LLMs. Available online: https:\/\/huggingface.co\/aaditya\/OpenBioLLM-Llama3-70B."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Xie, Q., Chen, Q., Chen, A., Peng, C., Hu, Y., Lin, F., Peng, X., Huang, J., Zhang, J., and Keloth, V. (2024). MeLLaMA: Foundation Large Language Models for Medical Applications. arXiv.","DOI":"10.21203\/rs.3.rs-4240043\/v1"},{"key":"ref_45","unstructured":"Zuo, Y., Qu, S., Li, Y., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. (2025). MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"943","DOI":"10.1038\/s41591-024-03423-7","article-title":"Toward expert-level medical question answering with large language models","volume":"31","author":"Singhal","year":"2025","journal-title":"Nat. Med."},{"key":"ref_47","unstructured":"Cabello, L., Martin-Turrero, C., Akujuobi, U., S\u00f8gaard, A., and Bobed, C. (2024). Meg: Medical knowledge-augmented large language models for question answering. arXiv."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"110614","DOI":"10.1016\/j.compbiomed.2025.110614","article-title":"A comparative evaluation of chain-of-thought-based prompt engineering techniques for medical question answering","volume":"196","author":"Jeon","year":"2025","journal-title":"Comput. Biol. Med."},{"key":"ref_49","unstructured":"Srinivasan, V., Jatav, V., Chandrababu, A., and Sharma, G. (2025). On the Performance of an Explainable Language Model on PubMedQA. arXiv."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/12\/299\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T05:18:33Z","timestamp":1764134313000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/12\/299"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,23]]},"references-count":49,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["bdcc9120299"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9120299","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,23]]}}}