{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T13:25:19Z","timestamp":1780320319780,"version":"3.54.1"},"reference-count":36,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Austrian Federal Ministry of Labour and Economy"},{"name":"SBA Research (SBA-K1 NGC)"},{"name":"Open Access Funding Program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.<\/jats:p>","DOI":"10.3390\/bdcc9070174","type":"journal-article","created":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T06:10:26Z","timestamp":1751436626000},"page":"174","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-7361-6430","authenticated-orcid":false,"given":"Philip","family":"K\u00f6nig","sequence":"first","affiliation":[{"name":"SBA Research gGmbH, Floragasse 7\/5.OG, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2206-9263","authenticated-orcid":false,"given":"Sebastian","family":"Raubitzek","sequence":"additional","affiliation":[{"name":"SBA Research gGmbH, Floragasse 7\/5.OG, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7645-2099","authenticated-orcid":false,"given":"Alexander","family":"Schatten","sequence":"additional","affiliation":[{"name":"Institute of Information Systems Engineering, TU Wien, Favoritenstrasse 9\u201311\/194, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dennis","family":"Toth","sequence":"additional","affiliation":[{"name":"SBA Research gGmbH, Floragasse 7\/5.OG, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fabian","family":"Obermann","sequence":"additional","affiliation":[{"name":"SBA Research gGmbH, Floragasse 7\/5.OG, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1337-7504","authenticated-orcid":false,"given":"Caroline","family":"K\u00f6nig","sequence":"additional","affiliation":[{"name":"Christian Doppler Laboratory for Assurance and Transparency in Software Protection, Research Group Security & Privacy, Faculty of Computer Science, University of Vienna, Kolingasse 14\u201316, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3031-505X","authenticated-orcid":false,"given":"Kevin","family":"Mallinger","sequence":"additional","affiliation":[{"name":"SBA Research gGmbH, Floragasse 7\/5.OG, 1040 Vienna, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1040","DOI":"10.1016\/j.ins.2008.12.001","article-title":"Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem","volume":"179","author":"Catal","year":"2009","journal-title":"Inf. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1049\/iet-sen.2017.0148","article-title":"Progress on approaches to software defect prediction","volume":"12","author":"Li","year":"2018","journal-title":"IET Softw."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nagappan, N., and Ball, T. (2005, January 15\u201321). Use of relative code churn measures to predict system defect density. Proceedings of the 27th International Conference on Software Engineering, ICSE \u201905, New York, NY, USA.","DOI":"10.1145\/1062455.1062514"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1007\/s10462-017-9563-5","article-title":"A study on software fault prediction techniques","volume":"51","author":"Rathore","year":"2019","journal-title":"Artif. Intell. Rev."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1208","DOI":"10.1109\/TSE.2013.11","article-title":"NASA MDP Software Defects Data Sets","volume":"39","author":"Shepperd","year":"2018","journal-title":"IEEE Trans. Softw. Eng."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"D\u2019Ambros, M., Lanza, M., and Robbes, R. (2010, January 2\u20133). An Extensive Comparison of Bug Prediction Approaches. Proceedings of the MSR 2010 (7th IEEE Working Conference on Mining Software Repositories), Cape Town, South Africa.","DOI":"10.1109\/MSR.2010.5463279"},{"key":"ref_7","unstructured":"Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., and Gulin, A. (2018, January 3\u20138). CatBoost: unbiased boosting with categorical features. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada."},{"key":"ref_8","unstructured":"Dorogush, A.V., Ershov, V., and Gulin, A. (2017, January 8). CatBoost: gradient boosting with categorical features support. Proceedings of the Workshop on ML Systems at NeurIPS, Long Beach, CA, USA."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"7839","DOI":"10.1007\/s10462-022-10371-6","article-title":"Data quality issues in software fault prediction: a systematic literature review","volume":"56","author":"Bhandari","year":"2023","journal-title":"Artif. Intell. Rev."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Gunda, S.K. (2024, January 22\u201323). Software Defect Prediction Using Advanced Ensemble Techniques: A Focus on Boosting and Voting Method. Proceedings of the 2024 International Conference on Electronic Systems and Intelligent Computing (ICESIC), Chennai, India.","DOI":"10.1109\/ICESIC61777.2024.10846550"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.compeleceng.2018.02.043","article-title":"Empirical analysis of change metrics for software fault prediction","volume":"67","author":"Choudhary","year":"2018","journal-title":"Comput. Electr. Eng."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Aljamaan, H., and Alazba, A. (2020, January 8\u20139). Software defect prediction using tree-based ensembles. Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2020, New York, NY, USA.","DOI":"10.1145\/3416508.3417114"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"169","DOI":"10.35882\/jeeemi.v6i2.388","article-title":"Optimizing Software Defect Prediction Models: Integrating Hybrid Grey Wolf and Particle Swarm Optimization for Enhanced Feature Selection with Popular Gradient Boosting Algorithm","volume":"6","author":"Akbar","year":"2024","journal-title":"J. Electron. Electromed. Eng. Med. Inform."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"100386","DOI":"10.1016\/j.eij.2023.05.011","article-title":"Reliable prediction of software defects using Shapley interpretable machine learning models","volume":"24","author":"Eshtay","year":"2023","journal-title":"Egypt. Inform. J."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zahan, M. (2023, January 21\u201323). Prediction of Faults in Embedded Software Using Machine Learning Approaches. Proceedings of the 2023 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh.","DOI":"10.1109\/ICICT4SD59951.2023.10303419"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Uddin, M.N., Li, B., Mondol, M.N., Rahman, M.M., Mia, M.S., and Mondol, E.L. (2021, January 14\u201316). SDP-ML: An Automated Approach of Software Defect Prediction employing Machine Learning Techniques. Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Virtual.","DOI":"10.1109\/ICECIT54077.2021.9641218"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Gupta, A., Sharma, S., Goyal, S., and Rashid, M. (2020, January 17\u201319). Novel XGBoost Tuned Machine Learning Model for Software Bug Prediction. Proceedings of the 2020 International Conference on Intelligent Engineering and Management (ICIEM), London, UK.","DOI":"10.1109\/ICIEM48762.2020.9160152"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"653","DOI":"10.1109\/32.859533","article-title":"Predicting fault incidence using software change history","volume":"26","author":"Graves","year":"2000","journal-title":"IEEE Trans. Softw. Eng."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Moser, R., Pedrycz, W., and Succi, G. (2008, January 10\u201318). A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. Proceedings of the 30th International Conference on Software Engineering, ICSE \u201908, New York, NY, USA.","DOI":"10.1145\/1368088.1368114"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1145\/69605.2085","article-title":"Software errors and complexity: an empirical investigation0","volume":"27","author":"Basili","year":"1984","journal-title":"Commun. ACM"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Hassan, A.E. (2009, January 16\u201324). Predicting faults using the complexity of code changes. Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, Vancouver, BC, Canada.","DOI":"10.1109\/ICSE.2009.5070510"},{"key":"ref_22","unstructured":"(2000, January 11\u201314). Identifying reasons for software changes using historic databases. Proceedings of the 2000 International Conference on Software Maintenance, San Jose, CA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/BF00116251","article-title":"Induction of Decision Trees","volume":"1","author":"Quinlan","year":"1986","journal-title":"Mach. Learn."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random Forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 13\u201317). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201916, New York, NY, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_26","unstructured":"Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.Y. (2017, January 4\u20139). LightGBM: a highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS\u201917, Red Hook, NY, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Raubitzek, S., Corpaci, L., Hofer, R., and Mallinger, K. (2023). Scaling Exponents of Time Series Data: A Machine Learning Approach. Entropy, 25.","DOI":"10.20944\/preprints202311.0467.v1"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Raubitzek, S., and Mallinger, K. (2023). On the Applicability of Quantum Machine Learning. Entropy, 25.","DOI":"10.20944\/preprints202305.0833.v1"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Men\u00e9ndez, H.D., Bello-Orgaz, G., Barnard, P., Bautista, J.R., Farahi, A., Dash, S., Han, D., Fortz, S., and Rodriguez-Fernandez, V. (2025). Estimating Combinatorial t-Way Coverage Based on Matrix Complexity Metrics. Proceedings of the Testing Software and Systems, Naples, Italy, 31 March\u20134 April 2025, Springer.","DOI":"10.1007\/978-3-031-80889-0"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"101429","DOI":"10.1016\/j.cosust.2024.101429","article-title":"Potentials and limitations of complexity research for environmental sciences and modern farming applications","volume":"67","author":"Mallinger","year":"2024","journal-title":"Curr. Opin. Environ. Sustain."},{"key":"ref_31","unstructured":"Snoek, J., Larochelle, H., and Adams, R.P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. arXiv."},{"key":"ref_32","unstructured":"Head, T., Kumar, M., Nahrstaedt, H., Louppe, G., and Shcherbatyi, I. (2025, July 01). Scikit-Optimize\/Scikit-Optimize (v0.9.0). Available online: https:\/\/scikit-optimize.github.io\/stable\/."},{"key":"ref_33","unstructured":"Developers, C. (2025, March 10). Feature Importance Calculation in CatBoost. Available online: https:\/\/catboost.ai\/docs\/en\/features\/feature-importances-calculation."},{"key":"ref_34","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_35","unstructured":"Parnas, D. (1994, January 16\u201321). Software aging. Proceedings of the 16th International Conference on Software Engineering, Sorrento, Italy."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/MS.2025.3549628","article-title":"From Code Generation to Software Testing: AI Copilot with Context-Based RAG","volume":"42","author":"Wang","year":"2025","journal-title":"IEEE Softw."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/7\/174\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:02:59Z","timestamp":1760032979000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/7\/174"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,2]]},"references-count":36,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["bdcc9070174"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9070174","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,2]]}}}