{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T21:34:42Z","timestamp":1769549682130,"version":"3.49.0"},"reference-count":39,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T00:00:00Z","timestamp":1741564800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Software"],"abstract":"<jats:p>Software testing ensures the quality and reliability of software products, but manual test case creation is labor-intensive. With the rise of Large Language Models (LLMs), there is growing interest in unit test creation with LLMs. However, effective assessment of LLM-generated test cases is limited by the lack of standardized benchmarks that comprehensively cover diverse programming scenarios. To address the assessment of an LLM\u2019s test case generation ability and lacking a dataset for evaluation, we propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach, which systematically generates programs used for evaluating LLMs\u2019 test generation capabilities. By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex. Because GPT-4o and GPT-3.5-Turbo are publicly accessible models, to present real-world regular users\u2019 use cases, we use GBCV to assess LLM performance on them. Our findings indicate that GPT-4o performs better on composite program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations. This study highlights the strengths and limitations of LLMs in test generation, provides a benchmark framework, and suggests directions for future improvement.<\/jats:p>","DOI":"10.3390\/software4010005","type":"journal-article","created":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T08:46:41Z","timestamp":1741596401000},"page":"5","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["A Systematic Approach for Assessing Large Language Models\u2019 Test Case Generation Capability"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2913-4419","authenticated-orcid":false,"given":"Hung-Fu","family":"Chang","sequence":"first","affiliation":[{"name":"R. B. Annis School of Engineering, University of Indianapolis, Indianapolis, IN 46227, USA"}]},{"given":"Mohammad","family":"Shokrolah Shirazi","sequence":"additional","affiliation":[{"name":"E. S. Witchger School of Engineering, Marian University, Indianapolis, IN 46222, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,10]]},"reference":[{"key":"ref_1","unstructured":"Tassey, G. (2002). The Economic Impacts of Inadequate Infrastructure for Software Testing, National Institute of Standards and Technology."},{"key":"ref_2","unstructured":"Grano, G., Scalabrino, S., Gall, H.C., and Oliveto, R. (June, January 27). An Empirical Investigation on the Readability of Manual and Generated Test Cases. Proceedings of the 26th International Conference on Program Comprehension, ICPC, Gothenburg, Sweden."},{"key":"ref_3","unstructured":"Tufano, M., Drain, D., Svyatkovskiy, A., Deng, S.K., and Sundaresan, N. (2020). Unit Test Case Generation with Transformers and Focal Context. arXiv, Available online: http:\/\/arxiv.org\/abs\/2009.05617."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1007\/s10664-023-10390-z","article-title":"Investigating the Readability of Test Code","volume":"29","author":"Winkler","year":"2024","journal-title":"Empir. Softw. Eng."},{"key":"ref_5","unstructured":"Chen, Y., Hu, Z., Zhi, C., Han, J., Deng, S., and Yin, J. (2023). ChatUniTest: A Framework for LLM-Based Test Generation. arXiv, Available online:  http:\/\/arxiv.org\/abs\/2305.04764."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Siddiq, M.L., Da Silva Santos, J.C., Tanvir, R.H., Ulfat, N., Al Rifat, F., and Carvalho Lopes, V. (2024, January 18\u201321). Using Large Language Models to Generate JUnit Tests: An Empirical Study. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno Italy.","DOI":"10.1145\/3661167.3661216"},{"key":"ref_7","unstructured":"Daka, E., and Fraser, G. (2014, January 3\u20136). A Survey on Unit Testing Practices and Problems. Proceedings of the International Symposium on Software Reliability Engineering, Naples, Italy."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1978","DOI":"10.1016\/j.jss.2013.02.061","article-title":"An Orchestrated Survey of Methodologies for Automated Software Test Case Generation","volume":"86","author":"Anand","year":"2013","journal-title":"J. Syst. Softw."},{"key":"ref_9","first-page":"6","article-title":"Systematic Review of Automatic Test Case Generation by UML Diagrams","volume":"1","author":"Kaur","year":"2012","journal-title":"Int. J. Eng. Res. Technol."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"110933","DOI":"10.1016\/j.jss.2021.110933","article-title":"On Introducing Automatic Test Case Generation in Practice: A Success Story and Lessons Learned","volume":"176","author":"Brunetto","year":"2021","journal-title":"J. Syst. Softw."},{"key":"ref_11","first-page":"1","article-title":"Structured Chain-of-Thought Prompting for Code Generation","volume":"34","author":"Li","year":"2024","journal-title":"ACM Trans. Softw. Eng. Methodol."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Li, J., Li, Y., Li, G., Jin, Z., Hao, Y., and Hu, X. (2023, January 14\u201320). SKCODER: A Sketch-Based Approach for Automatic Code Generation. Proceedings of the 2023 IEEE\/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, VIC, Australia.","DOI":"10.1109\/ICSE48619.2023.00179"},{"key":"ref_13","unstructured":"Li, J., Zhao, Y., Li, Y., Li, G., and Jin, Z. (2024, November 17). AceCoder: Utilizing Existing Code to Enhance Code Generation. Available online: https:\/\/api.semanticscholar.org\/CorpusID:257901190."},{"key":"ref_14","unstructured":"Dong, Y., Jiang, X., Jin, Z., and Li, G. (2023). Self-Collaboration Code Generation via ChatGPT. arXiv."},{"key":"ref_15","unstructured":"Chen, Y., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Metcalfe, J., Li, I., Yao, Q., and Roman, S. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-To-SQL Task. arXiv."},{"key":"ref_16","unstructured":"Li, J., Hui, B., Qu, G., Li, B., Yang, J., Li, B., Wang, B., Qin, B., Cao, R., and Geng, R. (2023). Can LLM Already Serve as a Database Interface? A BIg Bench for Large-Scale Database Grounded Text-To-SQLs. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, W., Yang, C., Wang, Z., Huang, Y., Chu, Z., Song, D., Zhang, L., Chen, A.R., and Ma, L. (2024). TESTEVAL: Benchmarking Large Language Models for Test Case Generation. arXiv.","DOI":"10.1145\/3691620.3695529"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Anand, S., and Harrold, M.J. (2011, January 6\u201310). Heap Cloning: Enabling Dynamic Symbolic Execution of Java Programs. Proceedings of the 2011 26th IEEE\/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA.","DOI":"10.1109\/ASE.2011.6100071"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Fraser, G., and Arcuri, A. (2011, January 13\u201314). Evolutionary Generation of Whole Test Suites. Proceedings of the 2011 11th International Conference on Quality Software, Madrid, Spain.","DOI":"10.1109\/QSIC.2011.19"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Fraser, G., and Arcuri, A. (2011, January 5\u20139). EvoSuite. Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering\u2014SIGSOFT\/FSE \u201911, Szeged, Hungary.","DOI":"10.1145\/2025113.2025179"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Almasi, M.M., Hemmati, H., Fraser, G., Arcuri, A., and Benefelds, J. (2017, January 20\u201328). An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application. Proceedings of the 2017 IEEE\/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), Buenos Aires, Argentina.","DOI":"10.1109\/ICSE-SEIP.2017.27"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Panichella, A., Panichella, S., Fraser, G., Sawant, A.A., and Hellendoorn, V.J. (2020). Replication Package of \u201cRevisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities\u201d, Zenodo (CERN European Organization for Nuclear Research).","DOI":"10.1109\/ICSME46990.2020.00056"},{"key":"ref_23","unstructured":"Sch\u00e4fer, M., Nadi, S., Eghbali, A., and Tip, F. (2023). An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv, Available online: https:\/\/arxiv.org\/abs\/2302.06527."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yuan, Z., Lou, Y., Liu, M., Ding, S., Wang, K., Chen, Y., and Peng, X. (2023). No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv.","DOI":"10.1145\/3660783"},{"key":"ref_25","unstructured":"Xie, Z., Chen, Y., Chen, Z., Deng, S., and Yin, J. (2023). ChatUniTest: A ChatGPT-Based Automated Unit Test Generation Tool. arXiv."},{"key":"ref_26","unstructured":"Vikram, V., Lemieux, C., and Padhye, R. (2023). Can Large Language Models Write Good Property-Based Tests?. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Koziolek, H., Ashiwal, V., Bandyopadhyay, S., and Chandrika, K.R. (2024). Automated Control Logic Test Case Generation Using Large Language Models. arXiv.","DOI":"10.1109\/ETFA61755.2024.10711016"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Plein, L., Ou\u00e9draogo, W.C., Klein, J., and Bissyand\u00e9, T.F. (2024). Automatic Generation of Test Cases Based on Bug Reports: A Feasibility Study with Large Language Models. arXiv.","DOI":"10.1145\/3639478.3643119"},{"key":"ref_29","unstructured":"Wang, C., Pastore, F., G\u00f6knil, A., and Briand, L.C. (2019). Automatic Generation of Acceptance Test Cases from Use Case Specifications: An NLP-Based Approach. arXiv."},{"key":"ref_30","unstructured":"Lan, W., Wang, Z., Chauhan, A., Zhu, H., Li, A., Guo, J., Zhang, S., Hang, C.-W., Lilien, J., and Hu, Y. (2023). UNITE: A Unified Benchmark for Text-To-SQL Evaluation. arXiv."},{"key":"ref_31","unstructured":"Chen, M.I.-C., Tworek, J., Jun, H., Yuan, Q., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., and Brockman, G. (2021). Evaluating Large Language Models Trained on Code. arXiv."},{"key":"ref_32","unstructured":"Liu, J., Xia, C.S., Wang, Y., and Zhang, L. (2024, May 10). Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Available online: https:\/\/openreview.net\/forum?id=1qvx610Cu7."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yan, W., Liu, H., Wang, Y., Li, Y., Chen, Q., Wang, W., Lin, T., Zhao, W., Zhu, L., and Deng, S. (2023). CodeScope: An Execution-Based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.301"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Watson, A.H., Wallace, D.R., and McCabe, T.J. (1996). Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric, NIST Special Publication.","DOI":"10.6028\/NIST.SP.500-235"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1109\/CJECE.2003.1532511","article-title":"A New Measure of Software Complexity Based on Cognitive Weights","volume":"28","author":"Shao","year":"2003","journal-title":"Can. J. Electr. Comput. Eng."},{"key":"ref_36","first-page":"1","article-title":"A Complexity Measure Based on Cognitive Weights","volume":"1","author":"Misra","year":"2006","journal-title":"Int. J. Theor. Appl. Comput. Sci."},{"key":"ref_37","unstructured":"Wei, J.S., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Le, Q.V., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3428261","article-title":"On the Unusual Effectiveness of Type-Aware Operator Mutations for Testing SMT Solvers","volume":"4","author":"Winterer","year":"2020","journal-title":"Proc. ACM Program. Lang."},{"key":"ref_39","unstructured":"OpenAI (2023). GPT-4 Technical Report. arXiv."}],"container-title":["Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/1\/5\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:49:58Z","timestamp":1760028598000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/1\/5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,10]]},"references-count":39,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["software4010005"],"URL":"https:\/\/doi.org\/10.3390\/software4010005","relation":{},"ISSN":["2674-113X"],"issn-type":[{"value":"2674-113X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,10]]}}}