{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T13:00:03Z","timestamp":1772542803233,"version":"3.50.1"},"reference-count":44,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T00:00:00Z","timestamp":1752451200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"EK\u00d6P-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Software"],"abstract":"<jats:p>Benchmark results for large language models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted to reproduce a subset of them by evaluating 12 models in the DeepSeekCoder, CodeGemma, CodeLlama, and WizardCoder model families, in different sizes and tunings. A total of 35 unique results were reported for these models across studies, of which we successfully reproduced 12. We identified several relevant factors that influenced the results. The base models can be confused with their instruction-tuned variants, making their results better than expected. Incorrect prompt templates or generation length can decrease benchmark performance, as well as using 4-bit quantization. Using sampling instead of greedy decoding can increase the variance, especially with higher temperature values. We found that precision and 8-bit quantization have less influence on benchmark results.<\/jats:p>","DOI":"10.3390\/software4030017","type":"journal-article","created":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T09:56:54Z","timestamp":1752487014000},"page":"17","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0458-2576","authenticated-orcid":false,"given":"Bal\u00e1zs","family":"Szalontai","sequence":"first","affiliation":[{"name":"Department of Software Technology, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, P\u00e1zm\u00e1ny P\u00e9ter s\u00e9t\u00e1ny 1\/C, H-1117 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-4572-0396","authenticated-orcid":false,"given":"Bal\u00e1zs","family":"M\u00e1rton","sequence":"additional","affiliation":[{"name":"Department of Software Technology, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, P\u00e1zm\u00e1ny P\u00e9ter s\u00e9t\u00e1ny 1\/C, H-1117 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3431-0667","authenticated-orcid":false,"given":"Bal\u00e1zs","family":"Pint\u00e9r","sequence":"additional","affiliation":[{"name":"Department of Software Technology, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, P\u00e1zm\u00e1ny P\u00e9ter s\u00e9t\u00e1ny 1\/C, H-1117 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9503-9623","authenticated-orcid":false,"given":"Tibor","family":"Gregorics","sequence":"additional","affiliation":[{"name":"Department of Software Technology, Faculty of Informatics, E\u00f6tv\u00f6s Lor\u00e1nd University, P\u00e1zm\u00e1ny P\u00e9ter s\u00e9t\u00e1ny 1\/C, H-1117 Budapest, Hungary"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,14]]},"reference":[{"key":"ref_1","unstructured":"Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T.Y., Singh, S., Tang, X., Werra, L.V., and Longpre, S. (2023, January 15). OctoPack: Instruction Tuning Code Large Language Models. Proceedings of the NeurIPS 2023 Workshop on Instruction Tuning and Instruction, New Orleans, LA, USA."},{"key":"ref_2","unstructured":"Rozi\u00e8re, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., Sauvestre, R., and Remez, T. (2024). Code Llama: Open Foundation Models for Code. arXiv."},{"key":"ref_3","unstructured":"Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., and Li, Y.K. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming\u2014The Rise of Code Intelligence. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lin, D., Koppel, J., Chen, A., and Solar-Lezama, A. (2017, January 22\u201327). QuixBugs: A multi-lingual program repair benchmark set based on the quixey challenge. Proceedings of the Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, New York, NY, USA.","DOI":"10.1145\/3135932.3135941"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Widyasari, R., Sim, S.Q., Lok, C., Qi, H., Phan, J., Tay, Q., Tan, C., Wee, F., Tan, J.E., and Yieh, Y. (2020, January 8\u201313). BugsInPy: A database of existing bugs in Python programs to enable controlled testing and debugging studies. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA.","DOI":"10.1145\/3368089.3417943"},{"key":"ref_6","unstructured":"Liu, S., Chai, L., Yang, J., Shi, J., Zhu, H., Wang, L., Jin, K., Zhang, W., Zhu, H., and Guo, S. (2024). MdEval: Massively Multilingual Code Debugging. arXiv."},{"key":"ref_7","unstructured":"Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H.P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., and Brockman, G. (2021). Evaluating Large Language Models Trained on Code. arXiv."},{"key":"ref_8","unstructured":"Wang, S., Asilis, J., \u00d6mer, F.A., Bilgin, E.B., Liu, O., and Neiswanger, W. (2025). Tina: Tiny Reasoning Models via LoRA. arXiv."},{"key":"ref_9","unstructured":"Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A.F., Ammanamanchi, P.S., Black, S., and Clive, J. (2024). Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Laskar, M.T.R., Alqahtani, S., Bari, M.S., Rahman, M., Khan, M.A.M., Khan, H., Jahan, I., Bhuiyan, A., Tan, C.W., and Parvez, M.R. (2024, January 12\u201316). A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA.","DOI":"10.18653\/v1\/2024.emnlp-main.764"},{"key":"ref_11","unstructured":"Hochlehnert, A., Bhatnagar, H., Udandarao, V., Albanie, S., Prabhu, A., and Bethge, M. (2025). A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv."},{"key":"ref_12","unstructured":"Yuan, J., Li, H., Ding, X., Xie, W., Li, Y.J., Zhao, W., Wan, K., Shi, J., Hu, X., and Liu, Z. (2025). Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning. arXiv."},{"key":"ref_13","unstructured":"Team, C., Zhao, H., Hui, J., Howland, J., Nguyen, N., Zuo, S., Hu, A., Choquette-Choo, C.A., Shen, J., and Kelley, J. (2024). CodeGemma: Open Code Models Based on Gemma. arXiv."},{"key":"ref_14","unstructured":"Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., and Jiang, D. (2023). WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv."},{"key":"ref_15","unstructured":"Ben Allal, L., Muennighoff, N., Kumar Umapathi, L., Lipkin, B., and von Werra, L. (2024, November 05). A Framework for the Evaluation of Code Generation Models. Available online: https:\/\/github.com\/bigcode-project\/bigcode-evaluation-harness."},{"key":"ref_16","unstructured":"Cassano, F., Li, L., Sethi, A., Shinn, N., Brennan-Jones, A., Lozhkov, A., Anderson, C.J., and Guha, A. (2024, January 7\u20139). Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions. Proceedings of the Conference on Language Modelling (COLM), Philadelphia, PA, USA."},{"key":"ref_17","unstructured":"Al-Onaizan, Y., Bansal, M., and Chen, Y.N. Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing."},{"key":"ref_18","unstructured":"Dehghan, M., Wu, J.J., Fard, F.H., and Ouni, A. (2024). MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair. arXiv."},{"key":"ref_19","unstructured":"Campos, V. (2024). Bug Detection and Localization using Pre-trained Code Language Models. INFORMATIK 2024, Gesellschaft f\u00fcr Informatik e.V."},{"key":"ref_20","unstructured":"Jiang, Y., He, Q., Zhuang, X., and Wu, Z. (2024). Code Comparison Tuning for Code Large Language Models. arXiv."},{"key":"ref_21","unstructured":"Jiang, H., Liu, Q., Li, R., Ye, S., and Wang, S. (2024). CursorCore: Assist Programming through Aligning Anything. arXiv."},{"key":"ref_22","unstructured":"Lozhkov, A., Li, R., Allal, L.B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., and Wei, Y. (2024). StarCoder 2 and The Stack v2: The Next Generation. arXiv."},{"key":"ref_23","unstructured":"Mishra, M., Stallone, M., Zhang, G., Shen, Y., Prasad, A., Soria, A.M., Merler, M., Selvam, P., Surendran, S., and Singh, S. (2024). Granite Code Models: A Family of Open Foundation Models for Code Intelligence. arXiv."},{"key":"ref_24","unstructured":"Moon, S., Chae, H., Song, Y., Kwon, T., Kang, D., iunn Ong, K.T., won Hwang, S., and Yeo, J. (2024). Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback. arXiv."},{"key":"ref_25","unstructured":"Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Darwish, K., and Agarwal, A. (2025, January 19\u201324). Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, Abu Dhabi, United Arab Emirates."},{"key":"ref_26","unstructured":"Shi, Y., Wang, S., Wan, C., and Gu, X. (2024). From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging. arXiv."},{"key":"ref_27","unstructured":"Singhal, M., Aggarwal, T., Awasthi, A., Natarajan, N., and Kanade, A. (2024). NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness. arXiv."},{"key":"ref_28","unstructured":"Wang, X., Li, B., Song, Y., Xu, F.F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., and Singh, J. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv."},{"key":"ref_29","first-page":"50528","article-title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering","volume":"Volume 37","author":"Globerson","year":"2024","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_30","unstructured":"Ku, L.W., Martins, A., and Srikumar, V. (2024, January 11\u201316). WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand."},{"key":"ref_31","unstructured":"Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., and Fan, A. (2024). The Llama 3 Herd of Models. arXiv."},{"key":"ref_32","unstructured":"BigScience Workshop, Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ili\u0107, S., Hesslow, D., Castagn\u00e9, R., Luccioni, A.S., and Yvon, G. (2023). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv."},{"key":"ref_33","unstructured":"Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., and Chim, J. (2023). StarCoder: May the source be with you!. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., and Li, Y. (2023;, January 6\u201310). CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA.","DOI":"10.1145\/3580305.3599790"},{"key":"ref_35","unstructured":"Bouamor, H., Pino, J., and Bali, K. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing."},{"key":"ref_36","unstructured":"Chiruzzo, L., Ritter, A., and Wang, L. CodeRAG-Bench: Can Retrieval Augment Code Generation?. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025."},{"key":"ref_37","unstructured":"Al-Onaizan, Y., Bansal, M., and Chen, Y.N. On Leakage of Code Generation Evaluation Datasets. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024."},{"key":"ref_38","unstructured":"Wei, Y., Wang, Z., Liu, J., Ding, Y., and Zhang, L. (2024, January 21\u201327). Magicoder: Empowering code generation with OSS-INSTRUCT. Proceedings of the 41st International Conference on Machine Learning, JMLR.org, Vienna, Austria."},{"key":"ref_39","unstructured":"Ku, L.W., Martins, A., and Srikumar, V. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024."},{"key":"ref_40","unstructured":"Lei, B., Li, Y., and Chen, Q. (2024). AutoCoder: Enhancing Code Large Language Model with AIEV-INSTRUCT. arXiv."},{"key":"ref_41","unstructured":"Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., and Lu, K. (2024). Qwen2.5-Coder Technical Report. arXiv."},{"key":"ref_42","unstructured":"Yu, Z., Zhao, Y., Cohan, A., and Zhang, X.P. (2024). HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation. arXiv."},{"key":"ref_43","unstructured":"Miao, Y., Gao, B., Quan, S., Lin, J., Zan, D., Liu, J., Yang, J., Liu, T., and Deng, Z. (2024). Aligning CodeLLMs with Direct Preference Optimization. arXiv."},{"key":"ref_44","unstructured":"Dou, S., Jia, H., Wu, S., Zheng, H., Zhou, W., Wu, M., Chai, M., Fan, J., Huang, C., and Tao, Y. (2024). What\u2019s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv."}],"container-title":["Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/3\/17\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:09:32Z","timestamp":1760033372000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/3\/17"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,14]]},"references-count":44,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["software4030017"],"URL":"https:\/\/doi.org\/10.3390\/software4030017","relation":{},"ISSN":["2674-113X"],"issn-type":[{"value":"2674-113X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,14]]}}}