{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T05:47:09Z","timestamp":1761198429152,"version":"build-2065373602"},"reference-count":29,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,2,2]],"date-time":"2025-02-02T00:00:00Z","timestamp":1738454400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"OpenAI Researcher Access Program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Software"],"abstract":"<jats:p>This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps\u2014a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT\u2019s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding\u2014where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline\u2019s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.<\/jats:p>","DOI":"10.3390\/software4010003","type":"journal-article","created":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T08:47:57Z","timestamp":1738572477000},"page":"3","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT"],"prefix":"10.3390","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0474-8214","authenticated-orcid":false,"given":"Nils","family":"Baumgartner","sequence":"first","affiliation":[{"name":"Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabr\u00fcck, 49074 Osnabr\u00fcck, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1765-3695","authenticated-orcid":false,"given":"Padma","family":"Iyenghar","sequence":"additional","affiliation":[{"name":"Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabr\u00fcck, 49074 Osnabr\u00fcck, Germany"},{"name":"Innotec GmbH, Hornbergstrasse 45, 70794 Filderstadt, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-0379-6616","authenticated-orcid":false,"given":"Timo","family":"Schoemaker","sequence":"additional","affiliation":[{"name":"Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabr\u00fcck, 49074 Osnabr\u00fcck, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8225-7261","authenticated-orcid":false,"given":"Elke","family":"Pulverm\u00fcller","sequence":"additional","affiliation":[{"name":"Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabr\u00fcck, 49074 Osnabr\u00fcck, Germany"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,2]]},"reference":[{"key":"ref_1","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). GPT-4 Technical Report. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1007\/s10515-012-0114-7","article-title":"A tool environment for quality assurance based on the Eclipse Modeling Framework","volume":"20","author":"Arendt","year":"2013","journal-title":"Autom. Softw. Eng."},{"key":"ref_3","unstructured":"Kryvinska, N., Gregus, M., and Fedushko, S. (2023). Code Smells: A Comprehensive Online Catalog and Taxonomy. Developments in Information and Knowledge Management Systems for Business Applications, Springer."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Baumgartner, N., and Pulverm\u00fcller, E. (2024, January 28\u201329). An Extensive Analysis of Data Clumps in UML Class Diagrams. Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering, Angers, France.","DOI":"10.5220\/0012550500003687"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Baumgartner, N., and Pulverm\u00fcller, E. (2024, January 21\u201323). The Lifecycle of Data Clumps: A Longitudinal Case Study in Open-Source Projects. Proceedings of the 12th International Conference on Model-Based Software and Systems Engineering (MODELSWARD 2024), Rome, Italy.","DOI":"10.5220\/0012313900003645"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"144","DOI":"10.5220\/0012698000003687","article-title":"Considerations in Prioritizing for Efficiently Refactoring the Data Clumps Model Smell: A Preliminary Study","volume":"Volume 1","author":"Baumgartner","year":"2024","journal-title":"Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering"},{"key":"ref_7","unstructured":"Fowler, M. (2018). Refactoring: Improving the Design of Existing Code, Pearson Deutschland."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zhang, M., Baddoo, N., Wernick, P., and Hall, T. (2008, January 15\u201316). Improving the Precision of Fowler\u2019s Definitions of Bad Smells. Proceedings of the 2008 32nd Annual IEEE Software Engineering Workshop, SEW \u201908, Kassandra, Greece.","DOI":"10.1109\/SEW.2008.26"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"33:1","DOI":"10.1145\/2629648","article-title":"Some Code Smells Have a Significant but Small Effect on Faults","volume":"23","author":"Hall","year":"2014","journal-title":"Acm Trans. Softw. Eng. Methodol."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Murphy-Hill, E., and Black, A.P. (2010, January 25\u201326). An interactive ambient visualization for code smells. Proceedings of the 5th International Symposium on Software Visualization, SOFTVIS \u201910, Salt Lake City, UT, USA.","DOI":"10.1145\/1879211.1879216"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Baumgartner, N., Iyenghar, P., Schoemaker, T., and Pulverm\u00fcller, E. (2024). AI-Driven Refactoring: A Pipeline for Identifying and Correcting Data Clumps in Git Repositories. Electronics, 13.","DOI":"10.3390\/electronics13091644"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kaindl, H., Mannion, M., and Maciaszek, L.A. (2023, January 24\u201325). Live Code Smell Detection of Data Clumps in an Integrated Development Environment. Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering, ENASE 2023, Prague, Czech Republic.","DOI":"10.1007\/978-3-031-64182-4"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"110610","DOI":"10.1016\/j.jss.2020.110610","article-title":"Code Smells and Refactoring: A Tertiary Systematic Review of Challenges and Observations","volume":"167","author":"Lacerda","year":"2020","journal-title":"J. Syst. Softw."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"103120","DOI":"10.1016\/j.scico.2024.103120","article-title":"Toward a novel taxonomy to capture code smells caused by refactoring","volume":"236","author":"Alkhomsan","year":"2024","journal-title":"Sci. Comput. Program."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, M., Baddoo, N., Wernick, P., and Hall, T. (2011, January 21\u201325). Prioritising Refactoring Using Code Bad Smells. Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, Berlin, Germany.","DOI":"10.1109\/ICSTW.2011.69"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"28800","DOI":"10.1109\/ACCESS.2022.3218007","article-title":"Revisiting Scenarios of Using Refactoring Techniques to Improve Software Systems Quality","volume":"11","author":"Almogahed","year":"2023","journal-title":"IEEE Access"},{"key":"ref_17","unstructured":"Noei, S. (2024). Empirical Study of Refactoring Practices: Rhythms, Tactics, and Release-wise Patterns in Software Projects. [Ph.D. Thesis, University of Ottawa]."},{"key":"ref_18","first-page":"456","article-title":"Adaptive Architectures in Software Engineering","volume":"12","author":"Smith","year":"2023","journal-title":"J. Softw. Archit."},{"key":"ref_19","unstructured":"Johnson, E., and Lee, M. (2023). Continuous Alignment Between Software Architecture Design and Development in CI\/CD Pipelines. Software Engineering, Springer."},{"key":"ref_20","unstructured":"Frank, E. (2024). The Future of CI\/CD: Leveraging AI for Seamless Deployments. EasyChair Preprin, 14745, Available online: https:\/\/easychair.org\/publications\/preprint\/7HqXG\/open."},{"key":"ref_21","first-page":"290","article-title":"AI-Powered Continuous Deployment: Achieving Zero Downtime and Faster Releases","volume":"12","author":"Kaluvakuri","year":"2023","journal-title":"Int. J. Innov. Eng. Manag. Res."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Nguyen-Duc, A., Abrahamsson, P., and Khomh, F. (2024). ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. Generative AI for Effective Software Development, Springer Nature.","DOI":"10.1007\/978-3-031-55642-5"},{"key":"ref_23","unstructured":"Cao, J., Li, M., Wen, M., and Cheung, S.-C. (2023). A study on Prompt Design, Advantages and Limitations of ChatGPT for Deep Learning Program Repair. arXiv."},{"key":"ref_24","unstructured":"Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., and Yang, Y. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Alshahwan, N., Harman, M., Harper, I., Marginean, A., Sengupta, S., and Wang, E. (2024). Assured LLM-Based Software Engineering. arXiv.","DOI":"10.1145\/3643661.3643953"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Kotsiantis, S., Verykios, V., and Tzagarakis, M. (2024). AI-Assisted Programming Tasks Using Code Embeddings and Transformers. Electronics, 13.","DOI":"10.3390\/electronics13040767"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Barros, M., and Labiche, Y. (2015, January 5\u20137). Search-Based Refactoring: Metrics Are Not Enough. Proceedings of the Search-Based Software Engineering, Bergamo, Italy.","DOI":"10.1007\/978-3-319-22183-0"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1007\/s11219-019-09477-y","article-title":"Automatic software refactoring: A systematic literature review","volume":"28","author":"Baqais","year":"2020","journal-title":"Softw. Qual. J."},{"key":"ref_29","unstructured":"Data Clump Solver (2025, January 30). Github Source (in CSV Format) for Projects Used in Experiments in This Study. Available online: https:\/\/github.com\/compf\/data_clump_solver\/blob\/main\/src\/eval\/evalAnalyzer\/prData\/project_data.csv."}],"container-title":["Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/1\/3\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:25:58Z","timestamp":1760027158000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2674-113X\/4\/1\/3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,2]]},"references-count":29,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["software4010003"],"URL":"https:\/\/doi.org\/10.3390\/software4010003","relation":{},"ISSN":["2674-113X"],"issn-type":[{"type":"electronic","value":"2674-113X"}],"subject":[],"published":{"date-parts":[[2025,2,2]]}}}