{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T17:10:03Z","timestamp":1772039403111,"version":"3.50.1"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","funder":[{"name":"National Natural Science Foundation of China","award":["62206318, 62032025"],"award-info":[{"award-number":["62206318, 62032025"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2025,6,19]]},"abstract":"<jats:p>Question answering (QA) is a fundamental task of large language models (LLMs), which requires LLMs to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (i.e., hallucinations) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we propose DrHall, a framework for detecting and reducing the factual hallucinations of black-box LLMs with metamorphic testing (MT). We believe that hallucinated answers are unstable. Therefore, when LLMs hallucinate, they are more likely to produce different answers if we use metamorphic testing to make LLMs re-execute the same task with different execution paths, which motivates the design of DrHall. The effectiveness of DrHall is evaluated empirically on three datasets, including a self-built dataset of natural language questions: FactHalluQA, as well as two datasets of programming questions: Refactory and LeetCode. The evaluation results confirm that DrHall can consistently outperform the state-of-the-art baselines, obtaining an average F1 score of over 0.856 for hallucination detection. For hallucination correction, DrHall can also outperform the state-of-the-art baselines, with an average hallucination correction rate of over 53%. We hope that our work can enhance the reliability of LLMs and provide new insights for the research of LLM hallucination mitigation.<\/jats:p>","DOI":"10.1145\/3715784","type":"journal-article","created":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T15:15:34Z","timestamp":1750346134000},"page":"1432-1453","source":"Crossref","is-referenced-by-count":4,"title":["Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7262-6219","authenticated-orcid":false,"given":"Weibin","family":"Wu","sequence":"first","affiliation":[{"name":"Sun Yat-sen University, Zhuhai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6800-889X","authenticated-orcid":false,"given":"Yuhang","family":"Cao","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Zhuhai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4788-3985","authenticated-orcid":false,"given":"Ning","family":"Yi","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Zhuhai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0013-1654","authenticated-orcid":false,"given":"Rongyi","family":"Ou","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Zhuhai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7878-4330","authenticated-orcid":false,"given":"Zibin","family":"Zheng","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Zhuhai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,19]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Dario Amodei Chris Olah Jacob Steinhardt Paul Christiano John Schulman and Dan Man\u00e9. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565."},{"key":"e_1_2_1_2_1","unstructured":"Apple Inc.. 2023. Apple Siri. https:\/\/www.apple.com\/siri\/"},{"key":"e_1_2_1_3_1","unstructured":"Baidu Inc.. 2023. Baidu Xiaodu. https:\/\/dumall.baidu.com\/global\/"},{"key":"e_1_2_1_4_1","volume-title":"Annual Conference of the Spanish Society for Natural Language Processing. https:\/\/api.semanticscholar.org\/CorpusID:264491948","author":"L\u00f3pez-Riob\u00f3o Botana I\u00f1igo","year":"2023","unstructured":"I\u00f1igo L\u00f3pez-Riob\u00f3o Botana, Dana Gallent-Iglesias, and Sonia Gonz\u00e1lez-V\u00e1zquez. 2023. QUA4I: Question answering for the Industry 4.0 domain. An application of intelligent virtual assistants. In Annual Conference of the Spanish Society for Natural Language Processing. https:\/\/api.semanticscholar.org\/CorpusID:264491948"},{"key":"e_1_2_1_5_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33 (2020), 1877\u20131901."},{"key":"e_1_2_1_6_1","volume-title":"International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=ccxD4mtkTU","author":"Chen Canyu","year":"2024","unstructured":"Canyu Chen and Kai Shu. 2024. Can LLM-generated misinformation be detected? International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=ccxD4mtkTU"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678670"},{"key":"e_1_2_1_9_1","unstructured":"Tsong Y Chen Shing C Cheung and Shiu Ming Yiu. 2020. Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3143561"},{"key":"e_1_2_1_11_1","unstructured":"I Chern Steffi Chern Shiqi Chen Weizhe Yuan Kehua Feng Chunting Zhou Junxian He Graham Neubig and Pengfei Liu. 2023. FacTool: Factuality detection in generative AI\u2013a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528."},{"key":"e_1_2_1_12_1","unstructured":"Christopher Clark Kenton Lee Ming-Wei Chang Tom Kwiatkowski Michael Collins and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes\/no questions. arXiv preprint arXiv:1905.10044."},{"key":"e_1_2_1_13_1","first-page":"58613","article-title":"Blurred-Dilated method for adversarial attacks","volume":"36","author":"Deng Yang","year":"2023","unstructured":"Yang Deng, Weibin Wu, Jianping Zhang, and Zibin Zheng. 2023. Blurred-Dilated method for adversarial attacks. In Advances in Neural Information Processing Systems. 36, 58613\u201358624.","journal-title":"Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_15_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing. 7413\u20137417","author":"Gondala Sashank","year":"2021","unstructured":"Sashank Gondala, Lyan Verwimp, Ernest Pusateri, Manos Tsagkias, and Christophe Van Gysel. 2021. Error-driven pruning of language models for virtual assistants. In IEEE International Conference on Acoustics, Speech and Signal Processing. 7413\u20137417."},{"key":"e_1_2_1_16_1","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten and Alex Vaughan. 2024. The Llama 3 herd of models. arXiv e-prints arXiv\u20132407."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00454"},{"key":"e_1_2_1_18_1","unstructured":"Dan Hendrycks Nicholas Carlini John Schulman and Jacob Steinhardt. 2021. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916."},{"key":"e_1_2_1_19_1","volume-title":"Natural language question answering: The view from here. natural language engineering, 7, 4","author":"Hirschman Lynette","year":"2001","unstructured":"Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: The view from here. natural language engineering, 7, 4 (2001), 275\u2013300."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00044"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3703155"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3660818"},{"key":"e_1_2_1_24_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, and Lucile Saulnier.","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, and Lucile Saulnier. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825."},{"key":"e_1_2_1_25_1","unstructured":"Saurav Kadavath Tom Conerly Amanda Askell Tom Henighan Dawn Drain Ethan Perez Nicholas Schiefer Zac Hatfield-Dodds Nova DasSarma and Eli Tran-Johnson. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221."},{"key":"e_1_2_1_26_1","unstructured":"Byron Kaye. 2023. Australian mayor readies world\u2019s first defamation lawsuit over ChatGPT content. https:\/\/www.reuters.com\/technology\/australian-mayor-readies-worlds-first-defamation-lawsuit-over-chatgpt-content-2023-04-05\/"},{"key":"e_1_2_1_27_1","first-page":"22199","article-title":"Large language models are zero-shot reasoners","volume":"35","author":"Kojima Takeshi","year":"2022","unstructured":"Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35 (2022), 22199\u201322213.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2013.46"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Junyu Luo Cao Xiao and Fenglong Ma. 2023. Zero-resource hallucination prevention for large language models. arXiv preprint arXiv:2309.02654.","DOI":"10.18653\/v1\/2024.findings-emnlp.204"},{"key":"e_1_2_1_31_1","volume-title":"Tsong Yueh Chen, and Hai L Vu","author":"Luu Quang-Hung","year":"2022","unstructured":"Quang-Hung Luu, Huai Liu, Tsong Yueh Chen, and Hai L Vu. 2022. A sequential metamorphic testing framework for understanding automated driving systems. arXiv preprint arXiv:2206.03075."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_33_1","volume-title":"Yee Whye Teh, and Tom Rainforth","author":"Miao Ning","year":"2023","unstructured":"Ning Miao, Yee Whye Teh, and Tom Rainforth. 2023. Selfcheck: Using LLMs to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_35_1","unstructured":"Ministry of Education of China. 2019. China\u2019s national college entrance examination. https:\/\/gaokao.neea.edu.cn\/html1\/category\/1509\/6212-1.htm"},{"key":"e_1_2_1_36_1","unstructured":"Niels M\u00fcndler Jingxuan He Slobodan Jenko and Martin Vechev. 2023. Self-contradictory hallucinations of large language models: Evaluation detection and mitigation. arXiv preprint arXiv:2305.15852."},{"key":"e_1_2_1_37_1","unstructured":"OpenAI Inc.. 2023. Introducing ChatGPT. https:\/\/openai.com\/blog\/chatgpt"},{"key":"e_1_2_1_38_1","unstructured":"Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek and Yonatan Belinkov. 2024. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. arXiv preprint arXiv:2410.02707."},{"key":"e_1_2_1_39_1","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, and Alex Ray. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35 (2022), 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12143170"},{"key":"e_1_2_1_41_1","volume-title":"Salted: A framework for salient long-tail translation error detection. arXiv preprint arXiv:2205.09988.","author":"Raunak Vikas","year":"2022","unstructured":"Vikas Raunak, Matt Post, and Arul Menezes. 2022. Salted: A framework for salient long-tail translation error detection. arXiv preprint arXiv:2205.09988."},{"key":"e_1_2_1_42_1","volume-title":"Quantifying language models","author":"Sclar Melanie","unstructured":"Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2023. Quantifying language models\u2019 sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324."},{"key":"e_1_2_1_43_1","unstructured":"Mrinank Sharma Meg Tong Tomasz Korbak David Duvenaud Amanda Askell Samuel R Bowman Newton Cheng Esin Durmus Zac Hatfield-Dodds and Scott R Johnston. 2023. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3551349.3556953"},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Derek Tam Anisha Mascarenhas Shiyue Zhang Sarah Kwan Mohit Bansal and Colin Raffel. 2022. Evaluating the factual consistency of large language models through summarization. arXiv preprint arXiv:2211.08412.","DOI":"10.18653\/v1\/2023.findings-acl.322"},{"key":"e_1_2_1_46_1","unstructured":"Yiming Tan Dehai Min Yu Li Wenbo Li Nan Hu Yongrui Chen and Guilin Qi. 2023. Evaluation of ChatGPT as a question answering system for answering complex questions. arXiv preprint arXiv:2303.07992."},{"key":"e_1_2_1_47_1","doi-asserted-by":"crossref","unstructured":"Yuchen Tian Weixiang Yan Qian Yang Qian Chen Wen Wang Ziyang Luo and Lei Ma. 2024. CodeHalu: Code hallucinations in LLMs driven by execution-based verification. arXiv preprint arXiv:2405.00253.","DOI":"10.1609\/aaai.v39i24.34717"},{"key":"e_1_2_1_48_1","unstructured":"Neeraj Varshney Wenlin Yao Hongming Zhang Jianshu Chen and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987."},{"key":"e_1_2_1_49_1","unstructured":"Cunxiang Wang Xiaoze Liu Yuanhao Yue Xiangru Tang Tianhang Zhang Cheng Jiayang Yunzhi Yao Wenyang Gao Xuming Hu and Zehan Qi. 2023. Survey on factuality in large language models: Knowledge retrieval and domain-specificity. arXiv preprint arXiv:2310.07521."},{"key":"e_1_2_1_50_1","volume-title":"IEEE\/ACM International Conference on Automated Software Engineering. 1053\u20131065","author":"Wang Shuai","year":"2020","unstructured":"Shuai Wang and Zhendong Su. 2020. Metamorphic object insertion for testing object detection systems. In IEEE\/ACM International Conference on Automated Software Engineering. 1053\u20131065."},{"key":"e_1_2_1_51_1","volume-title":"Aakanksha Chowdhery, and Denny Zhou.","author":"Wang Xuezhi","year":"2022","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","unstructured":"Yuxia Wang Minghan Wang Hasan Iqbal Georgi Georgiev Jiahui Geng and Preslav Nakov. 2024. OpenFactCheck: A unified framework for factuality evaluation of LLMs. Empirical Methods in Natural Language Processing: System Demonstrations 219\u2013229. https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-demo.23 10.18653\/v1\/2024.emnlp-demo.23","DOI":"10.18653\/v1"},{"key":"e_1_2_1_53_1","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35 (2022), 24824\u201324837.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02324"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE48619.2023.00054"},{"key":"e_1_2_1_56_1","unstructured":"Caiming Xiong Victor Zhong and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604."},{"key":"e_1_2_1_57_1","volume-title":"International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=gjeQKFxFpZ","author":"Xiong Miao","year":"2024","unstructured":"Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=gjeQKFxFpZ"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2023"},{"key":"e_1_2_1_60_1","first-page":"44502","article-title":"Felm: Benchmarking factuality evaluation of large language models","volume":"36","author":"Zhao Yiran","year":"2023","unstructured":"Yiran Zhao, Jinghan Zhang, I Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems, 36 (2023), 44502\u201344523.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_61_1","unstructured":"Shen Zheng Jie Huang and Kevin Chen-Chuan Chang. 2023. Why does ChatGPT fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513."},{"key":"e_1_2_1_62_1","unstructured":"Andy Zhou Kai Yan Michal Shlapentokh-Rothman Haohan Wang and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406."},{"key":"e_1_2_1_63_1","volume-title":"Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han.","author":"Zhou Kun","year":"2023","unstructured":"Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023. Don\u2019t make your LLM an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964."}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3715784","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T15:20:40Z","timestamp":1750346440000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3715784"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,19]]},"references-count":63,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2025,6,19]]}},"alternative-id":["10.1145\/3715784"],"URL":"https:\/\/doi.org\/10.1145\/3715784","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,19]]}}}