{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T07:15:13Z","timestamp":1760426113757,"version":"3.37.3"},"reference-count":85,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T00:00:00Z","timestamp":1667865600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T00:00:00Z","timestamp":1667865600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004410","name":"T\u00fcrkiye Bilimsel ve Teknolojik Ara\u015ftirma Kurumu","doi-asserted-by":"publisher","award":["BIDEB 2232 grant 118C255"],"award-info":[{"award-number":["BIDEB 2232 grant 118C255"]}],"id":[{"id":"10.13039\/501100004410","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Empir Software Eng"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Context<\/jats:title>\n                <jats:p>Automated classifiers, often based on machine learning (ML), are increasingly used in software engineering (SE) for labelling previously unseen SE data. Researchers have proposed automated classifiers that predict if a code chunk is a clone, if a requirement is functional or non-functional, if the outcome of a test case is non-deterministic, etc.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Objective<\/jats:title>\n                <jats:p>The lack of guidelines for applying and reporting classification techniques for SE research leads to studies in which important research steps may be skipped, key findings might not be identified and shared, and the readers may find reported results (e.g., precision or recall above 90%) that are not a credible representation of the performance in operational contexts. The goal of this paper is to advance ML4SE research by proposing rigorous ways of conducting and reporting research.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We introduce the <jats:italic>ECSER<\/jats:italic> (Evaluating Classifiers in Software Engineering Research) pipeline, which includes a series of steps for conducting and evaluating automated classification research in SE. Then, we conduct two replication studies where we apply <jats:italic>ECSER<\/jats:italic> to recent research in requirements engineering and in software testing.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>In addition to demonstrating the applicability of the pipeline, the replication studies demonstrate <jats:italic>ECSER<\/jats:italic>\u2019s usefulness: not only do we confirm and strengthen some findings identified by the original authors, but we also discover additional ones. Some of these findings contradict the original ones.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1007\/s10664-022-10243-1","type":"journal-article","created":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T16:11:09Z","timestamp":1667923869000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Evaluating classifiers in SE research: the ECSER pipeline and two replication studies"],"prefix":"10.1007","volume":"28","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1162-8341","authenticated-orcid":false,"given":"Davide","family":"Dell\u2019Anna","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fatma Ba\u015fak","family":"Aydemir","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fabiano","family":"Dalpiaz","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,11,8]]},"reference":[{"key":"10243_CR1","unstructured":"Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado G S, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Man\u00e9 D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Vi\u00e9gas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) Tensorflow: large-scale machine learning on heterogeneous systems. https:\/\/www.tensorflow.org\/. Software available from tensorflow.org"},{"issue":"2","key":"10243_CR2","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1162\/089976600300015808","volume":"12","author":"NM Adams","year":"2000","unstructured":"Adams N M, Hand D J (2000) Improving the practice of classifier performance assessment. Neural Comput 12(2):305\u2013311","journal-title":"Neural Comput"},{"key":"10243_CR3","doi-asserted-by":"crossref","unstructured":"Agrawal A, Menzies T (2018) Is \u201cbetter data\u201d better than \u201cbetter data miners\u201d?. In: IEEE\/ACM international conference on software engineering, pp 1050\u20131061","DOI":"10.1145\/3180155.3180197"},{"key":"10243_CR4","doi-asserted-by":"publisher","first-page":"2939","DOI":"10.1109\/TSE.2021.3073242","volume":"48","author":"A Agrawal","year":"2021","unstructured":"Agrawal A, Yang X, Agrawal R, Yedida R, Shen X, Menzies T (2021) Simpler hyperparameter optimization for software analytics: why, how, when. IEEE Trans Softw Eng 48:2939\u20132954","journal-title":"IEEE Trans Softw Eng"},{"key":"10243_CR5","first-page":"CMC","volume":"9","author":"A Alonso-Betanzos","year":"2015","unstructured":"Alonso-Betanzos A, Bol\u00f3n-Canedo V, Heyndrickx G R, Kerkhof P L (2015) Exploring guidelines for classification of major heart failure subtypes by using machine learning. Clin Med Insights: Cardiol 9:CMC\u2013s18746","journal-title":"Clin Med Insights: Cardiol"},{"key":"10243_CR6","doi-asserted-by":"crossref","unstructured":"Alshammari A, Morris C, Hilton M, Bell J (2021a) Flakeflagger: predicting flakiness without rerunning tests. In: IEEE\/ACM international conference on software engineering, pp 1572\u20131584","DOI":"10.1109\/ICSE43902.2021.00140"},{"key":"10243_CR7","doi-asserted-by":"publisher","unstructured":"Alshammari A, Morris C, Hilton M, Bell J (2021b) Flaky test dataset to accompany \u201cFlakeFlagger: predicting flakiness without rerunning tests\u201d. https:\/\/doi.org\/10.5281\/zenodo.5014076","DOI":"10.5281\/zenodo.5014076"},{"issue":"1","key":"10243_CR8","first-page":"152","volume":"17","author":"A Benavoli","year":"2016","unstructured":"Benavoli A, Corani G, Mangili F (2016a) Should we really use post-hoc tests based on mean-ranks? J Mach Learn Res 17(1):152\u2013161","journal-title":"J Mach Learn Res"},{"issue":"1","key":"10243_CR9","first-page":"2653","volume":"18","author":"A Benavoli","year":"2017","unstructured":"Benavoli A, Corani G, Dem\u0161ar J, Zaffalon M (2017b) Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. J Mach Learn Res 18(1):2653\u20132688","journal-title":"J Mach Learn Res"},{"issue":"6","key":"10243_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10664-021-09986-0","volume":"26","author":"DM Berry","year":"2021","unstructured":"Berry D M (2021) Empirical evaluation of tools for hairy requirements engineering tasks. Empir Softw Eng 26(6):1\u201377","journal-title":"Empir Softw Eng"},{"key":"10243_CR11","volume-title":"Pattern recognition and machine learning","author":"CM Bishop","year":"2006","unstructured":"Bishop C M (2006) Pattern recognition and machine learning. Springer, New York"},{"key":"10243_CR12","doi-asserted-by":"crossref","unstructured":"Boyd K, Eng K H Jr (2013) C.D.P.: area under the precision-recall curve: point estimates and confidence intervals. In: European conference on machine learning and principles and practice of knowledge discovery in databases, LNCS, vol 8190. Springer, pp 451\u2013466","DOI":"10.1007\/978-3-642-40994-3_29"},{"key":"10243_CR13","first-page":"2079","volume":"11","author":"GC Cawley","year":"2010","unstructured":"Cawley G C, Talbot N L (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079\u20132107","journal-title":"J Mach Learn Res"},{"key":"10243_CR14","doi-asserted-by":"crossref","unstructured":"Cleland-Huang J, Settimi R, Zou X, Solc P (2006) The detection and classification of non-functional requirements with application to early aspects. In: IEEE International requirements engineering conference, pp 39\u201348","DOI":"10.1109\/RE.2006.65"},{"issue":"2","key":"10243_CR15","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1007\/s00766-007-0045-1","volume":"12","author":"J Cleland-Huang","year":"2007","unstructured":"Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated classification of non-functional requirements. Requir Eng 12(2):103\u2013120","journal-title":"Requir Eng"},{"key":"10243_CR16","doi-asserted-by":"crossref","unstructured":"Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: IEEE\/ACM international conference on software engineering, pp 155\u2013164","DOI":"10.1145\/1806799.1806825"},{"key":"10243_CR17","volume-title":"Explaining psychological statistics","author":"BH Cohen","year":"2008","unstructured":"Cohen B H (2008) Explaining psychological statistics. Wiley, New York"},{"key":"10243_CR18","doi-asserted-by":"crossref","unstructured":"Dalpiaz F, Dell\u2019Anna D, Aydemir F B, \u010aevikol S (2019) Requirements classification with interpretable machine learning and dependency parsing. In: IEEE International requirements engineering conference, pp 142\u2013152","DOI":"10.1109\/RE.2019.00025"},{"key":"10243_CR19","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1016\/j.jss.2019.07.002","volume":"156","author":"FG de Oliveira Neto","year":"2019","unstructured":"de Oliveira Neto FG, Torkar R, Feldt R, Gren L, Furia CA, Huang Z (2019) Evolution of statistical analysis in empirical software engineering research: Current state and steps forward. J Syst Softw 156:246\u2013267","journal-title":"J Syst Softw"},{"key":"10243_CR20","doi-asserted-by":"publisher","unstructured":"Dell\u2019Anna D, Aydemir FB, Dalpiaz F (2021) Supplementary material for \u201cevaluating classifiers in SE research: the ECSER pipeline and two replication studies\u201d. https:\/\/doi.org\/10.5281\/zenodo.6266675","DOI":"10.5281\/zenodo.6266675"},{"key":"10243_CR21","first-page":"1","volume":"7","author":"J Dem\u0161ar","year":"2006","unstructured":"Dem\u0161ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1\u201330","journal-title":"J Mach Learn Res"},{"key":"10243_CR22","unstructured":"Devlin J, Chang M W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805"},{"key":"10243_CR23","unstructured":"Dong G, Liu H (2018) Feature engineering for machine learning and data analytics. CRC Press"},{"key":"10243_CR24","doi-asserted-by":"publisher","DOI":"10.1017\/9781108671682","volume-title":"The art of feature engineering: essentials for machine learning","author":"P Duboue","year":"2020","unstructured":"Duboue P (2020) The art of feature engineering: essentials for machine learning. Cambridge University Press, Cambridge"},{"key":"10243_CR25","doi-asserted-by":"publisher","first-page":"e131","DOI":"10.7717\/peerj-cs.131","volume":"3","author":"F Fagerholm","year":"2017","unstructured":"Fagerholm F, Kuhrmann M, M\u00fcnch J (2017) Guidelines for using empirical studies in software engineering education. PeerJ Comput Sci 3:e131","journal-title":"PeerJ Comput Sci"},{"issue":"8","key":"10243_CR26","doi-asserted-by":"publisher","first-page":"861","DOI":"10.1016\/j.patrec.2005.10.010","volume":"27","author":"T Fawcett","year":"2006","unstructured":"Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861\u2013874","journal-title":"Pattern Recogn Lett"},{"key":"10243_CR27","doi-asserted-by":"crossref","unstructured":"Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press","DOI":"10.1017\/CBO9780511973000"},{"key":"10243_CR28","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1016\/j.infsof.2016.04.017","volume":"76","author":"W Fu","year":"2016","unstructured":"Fu W, Menzies T, Shen X (2016) Tuning for software analytics: is it really necessary? Inf Softw Technol 76:135\u2013146","journal-title":"Inf Softw Technol"},{"key":"10243_CR29","doi-asserted-by":"crossref","unstructured":"Garousi V, Felderer M (2017) Experience-based guidelines for effective and efficient data extraction in systematic reviews in software engineering. In: International conference on evaluation and assessment in software engineering, pp 170\u2013179","DOI":"10.1145\/3084226.3084238"},{"key":"10243_CR30","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1016\/j.infsof.2018.09.006","volume":"106","author":"V Garousi","year":"2019","unstructured":"Garousi V, Felderer M, M\u00e4ntyl\u00e4 M V (2019) Guidelines for including grey literature and conducting multivocal literature reviews in software engineering. Inf Softw Technol 106:101\u2013121","journal-title":"Inf Softw Technol"},{"key":"10243_CR31","doi-asserted-by":"crossref","unstructured":"Ghotra B, McIntosh S, Hassan A E (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: IEEE\/ACM International conference on software engineering, pp 789\u2013800","DOI":"10.1109\/ICSE.2015.91"},{"issue":"1\u20133","key":"10243_CR32","doi-asserted-by":"publisher","first-page":"231","DOI":"10.1007\/s10994-006-8958-3","volume":"64","author":"M Goadrich","year":"2006","unstructured":"Goadrich M, Oliphant L, Shavlik J W (2006) Gleaner: creating ensembles of first-order clauses to improve recall-precision curves. Mach Learn 64 (1\u20133):231\u2013261","journal-title":"Mach Learn"},{"key":"10243_CR33","unstructured":"Good P (2013) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media"},{"issue":"1","key":"10243_CR34","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1038\/s41580-021-00407-0","volume":"23","author":"JG Greener","year":"2022","unstructured":"Greener J G, Kandathil S M, Moffat L, Jones D T (2022) A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23(1):40\u201355","journal-title":"Nat Rev Mol Cell Biol"},{"issue":"1","key":"10243_CR35","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1145\/1656274.1656278","volume":"11","author":"M Hall","year":"2009","unstructured":"Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I H (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10\u201318","journal-title":"ACM SIGKDD Explor Newsl"},{"issue":"6","key":"10243_CR36","doi-asserted-by":"publisher","first-page":"1276","DOI":"10.1109\/TSE.2011.103","volume":"38","author":"T Hall","year":"2012","unstructured":"Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276\u20131304","journal-title":"IEEE Trans Softw Eng"},{"issue":"48","key":"10243_CR37","doi-asserted-by":"publisher","first-page":"2173","DOI":"10.21105\/joss.02173","volume":"5","author":"S Herbold","year":"2020","unstructured":"Herbold S (2020) Autorank: a Python package for automated ranking of classifiers. J Open Source Softw 5(48):2173","journal-title":"J Open Source Softw"},{"issue":"6","key":"10243_CR38","doi-asserted-by":"publisher","first-page":"5333","DOI":"10.1007\/s10664-020-09885-w","volume":"25","author":"S Herbold","year":"2020","unstructured":"Herbold S, Trautsch A, Trautsch F (2020) On the feasibility of automated prediction of bug and non-bug issues. Empir Softw Eng 25(6):5333\u20135369","journal-title":"Empir Softw Eng"},{"key":"10243_CR39","doi-asserted-by":"crossref","unstructured":"Hey T, Keim J, Koziolek A, Tichy W F (2020a) Norbert: transfer learning for requirements classification. In: IEEE International requirements engineering conference, pp 169\u2013179","DOI":"10.1109\/RE48521.2020.00028"},{"key":"10243_CR40","doi-asserted-by":"publisher","unstructured":"Hey T, Keim J, Koziolek A, Tichy WF (2020b) Supplementary material of \u201cNoRBERT: transfer learning for requirements classification. https:\/\/doi.org\/10.5281\/zenodo.3874137","DOI":"10.5281\/zenodo.3874137"},{"key":"10243_CR41","unstructured":"Huff D (1993) How to lie with statistics. WW Norton & Company"},{"key":"10243_CR42","doi-asserted-by":"crossref","unstructured":"Hutchinson B, Smart A, Hanna A, Denton E, Greer C, Kjartansson O, Barnes P, Mitchell M (2021) Towards accountability for machine learning datasets: practices from software engineering and infrastructure. In: ACM Conference on fairness, accountability, and transparency, pp 560\u2013575","DOI":"10.1145\/3442188.3445918"},{"key":"10243_CR43","doi-asserted-by":"crossref","unstructured":"Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press","DOI":"10.1017\/CBO9780511921803"},{"key":"10243_CR44","first-page":"201","volume-title":"Reporting experiments in software engineering","author":"A Jedlitschka","year":"2008","unstructured":"Jedlitschka A, Ciolkowski M, Pfahl D (2008) Reporting experiments in software engineering. Springer, London, pp 201\u2013228"},{"key":"10243_CR45","unstructured":"Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30"},{"key":"10243_CR46","unstructured":"Kitchenham B (2004) Procedures for performing systematic reviews. Tech Rep. 2004. Keele University, Keele"},{"issue":"2","key":"10243_CR47","doi-asserted-by":"publisher","first-page":"579","DOI":"10.1007\/s10664-016-9437-5","volume":"22","author":"B Kitchenham","year":"2017","unstructured":"Kitchenham B, Madeyski L, Budgen D, Keung J, Brereton P, Charters S, Gibbs S, Pohthong A (2017) Robust statistical methods for empirical software engineering. Empir Softw Eng 22(2):579\u2013630","journal-title":"Empir Softw Eng"},{"issue":"2","key":"10243_CR48","doi-asserted-by":"publisher","first-page":"151","DOI":"10.32614\/RJ-2014-031","volume":"6","author":"S Korkmaz","year":"2014","unstructured":"Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an r package for assessing multivariate normality. R J 6(2):151\u2013162","journal-title":"R J"},{"issue":"6","key":"10243_CR49","doi-asserted-by":"publisher","first-page":"2852","DOI":"10.1007\/s10664-016-9492-y","volume":"22","author":"M Kuhrmann","year":"2017","unstructured":"Kuhrmann M, Fern\u00e1ndez D M, Daneva M (2017) On the pragmatic design of literature studies in software engineering: an experience-based guideline. Empir Softw Eng 22(6):2852\u20132891","journal-title":"Empir Softw Eng"},{"key":"10243_CR50","doi-asserted-by":"crossref","unstructured":"Kurtanovic Z, Maalej W (2017) Automatically classifying functional and non-functional requirements using supervised machine learning. In: IEEE International requirements engineering conference, pp 490\u2013495","DOI":"10.1109\/RE.2017.82"},{"issue":"8","key":"10243_CR51","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1038\/nmeth.3945","volume":"13","author":"J Lever","year":"2016","unstructured":"Lever J (2016) Classification evaluation: it is important to understand both what a classification metric expresses and what it hides. Nat Methods 13(8):603\u2013605","journal-title":"Nat Methods"},{"key":"10243_CR52","doi-asserted-by":"crossref","unstructured":"Li F, Horkoff J, Mylopoulos J, Guizzardi R S S, Guizzardi G, Borgida A, Liu L (2014) Non-functional requirements as qualities, with a spice of ontology. In: IEEE International requirements engineering conference, pp 293\u2013302","DOI":"10.1109\/RE.2014.6912271"},{"issue":"1","key":"10243_CR53","first-page":"1","volume":"31","author":"C Liu","year":"2021","unstructured":"Liu C, Gao C, Xia X, Lo D, Grundy J, Yang X (2021) On the reproducibility and replicability of deep learning in software engineering. ACM Trans Softw Eng Methodol (TOSEM) 31(1):1\u201346","journal-title":"ACM Trans Softw Eng Methodol (TOSEM)"},{"key":"10243_CR54","unstructured":"Lones MA (2021) How to avoid machine learning pitfalls: a guide for academic researchers. CoRR arXiv:2108.02497"},{"issue":"12","key":"10243_CR55","doi-asserted-by":"publisher","first-page":"e323","DOI":"10.2196\/jmir.5870","volume":"18","author":"W Luo","year":"2016","unstructured":"Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, Shilton A, Yearwood J, Dimitrova N, Ho T B et al (2016) Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res 18(12):e323","journal-title":"J Med Internet Res"},{"issue":"1","key":"10243_CR56","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10664-021-10009-1","volume":"27","author":"A Mahadi","year":"2022","unstructured":"Mahadi A, Ernst N A, Tongay K (2022) Conclusion stability for natural language based mining of design discussions. Empir Softw Eng 27(1):1\u201342","journal-title":"Empir Softw Eng"},{"issue":"3","key":"10243_CR57","doi-asserted-by":"publisher","first-page":"519","DOI":"10.1093\/biomet\/57.3.519","volume":"57","author":"KV Mardia","year":"1970","unstructured":"Mardia K V (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519\u2013530","journal-title":"Biometrika"},{"key":"10243_CR58","doi-asserted-by":"crossref","unstructured":"Menzies T (2001) Practical machine learning for software engineering and knowledge engineering. In: Handbook of software engineering and knowledge engineering: volume I: fundamentals. World Scientific, pp 837\u2013862","DOI":"10.1142\/9789812389718_0035"},{"key":"10243_CR59","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1016\/j.infsof.2019.04.005","volume":"112","author":"T Menzies","year":"2019","unstructured":"Menzies T, Shepperd M (2019) \u201cBad smells\u201d in software analytics papers. Inf Softw Technol 112:35\u201347","journal-title":"Inf Softw Technol"},{"issue":"4","key":"10243_CR60","doi-asserted-by":"publisher","first-page":"375","DOI":"10.1007\/s10515-010-0069-5","volume":"17","author":"T Menzies","year":"2010","unstructured":"Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17(4):375\u2013407","journal-title":"Autom Softw Eng"},{"issue":"3","key":"10243_CR61","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1007\/s00766-018-0292-3","volume":"23","author":"L Montgomery","year":"2018","unstructured":"Montgomery L, Damian D, Bulmer T, Quader S (2018) Customer support ticket escalation prediction using feature engineering. Requir Eng 23(3):333\u2013355","journal-title":"Requir Eng"},{"key":"10243_CR62","doi-asserted-by":"crossref","unstructured":"Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: IEEE\/ACM International conference on software engineering, pp 181\u2013190","DOI":"10.1145\/1368088.1368114"},{"key":"10243_CR63","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"10243_CR64","doi-asserted-by":"crossref","unstructured":"Petersen K, Feldt R, Mujtaba S, Mattsson M (2008) Systematic mapping studies in software engineering. In: International conference on evaluation and assessment in software engineering, pp 1\u201310","DOI":"10.14236\/ewic\/EASE2008.8"},{"key":"10243_CR65","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.infsof.2015.03.007","volume":"64","author":"K Petersen","year":"2015","unstructured":"Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1\u201318","journal-title":"Inf Softw Technol"},{"key":"10243_CR66","doi-asserted-by":"crossref","unstructured":"Pinto G, Miranda B, Dissanayake S, d\u2019Amorim M, Treude C, Bertolino A (2020) What is the vocabulary of flaky tests?. In: International conference on mining software repositories, pp 492\u2013502","DOI":"10.1145\/3379597.3387482"},{"issue":"7","key":"10243_CR67","doi-asserted-by":"publisher","first-page":"1414","DOI":"10.1109\/TSE.2019.2924371","volume":"47","author":"G Rajbahadur","year":"2021","unstructured":"Rajbahadur G, Wang S, Kamei Y, Hassan A E (2021) Impact of discretization noise of the dependent variable on machine learning classifiers in software engineering. IEEE Trans Softw Eng 47(7):1414\u20131430","journal-title":"IEEE Trans Softw Eng"},{"key":"10243_CR68","unstructured":"Ralph P, Baltes S, Bianculli D, Dittrich Y, Felderer M, Feldt R, Filieri A, Furia C A, Graziotin D, He P, Hoda R, Juristo N, Kitchenham B A, Robbes R, M\u00e9ndez D, Molleri J, Spinellis D, Staron M, Stol K, Tamburri D A, Torchiano M, Treude C, Turhan B, Vegas S (2020) Empirical standards for software engineering research. CoRR arXiv:2010.03525"},{"issue":"3","key":"10243_CR69","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1007\/s10994-011-5256-5","volume":"85","author":"J Read","year":"2011","unstructured":"Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333\u2013359","journal-title":"Mach Learn"},{"issue":"3","key":"10243_CR70","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1023\/A:1009752403260","volume":"1","author":"SL Salzberg","year":"1997","unstructured":"Salzberg S L (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1(3):317\u2013328","journal-title":"Data Min Knowl Disc"},{"key":"10243_CR71","doi-asserted-by":"crossref","unstructured":"Sheskin D J (2020) Handbook of parametric and nonparametric statistical procedures. CRC Press","DOI":"10.1201\/9780429186196"},{"key":"10243_CR72","doi-asserted-by":"crossref","unstructured":"Siebert J, Joeckel L, Heidrich J, Nakamichi K, Ohashi K, Namba I, Yamamoto R, Aoyama M (2020) Towards guidelines for assessing qualities of machine learning systems. In: International conference on the quality of information and communications technology. Springer, pp 17\u201331","DOI":"10.1007\/978-3-030-58793-2_2"},{"key":"10243_CR73","unstructured":"Sorower M S (2010) A literature survey on algorithms for multi-label learning. Tech. rep., Oregon State University"},{"key":"10243_CR74","doi-asserted-by":"crossref","unstructured":"Stapor K (2017) Evaluating and comparing classifiers: review, some recommendations and limitations. In: International conference on computer recognition systems. Springer, pp 12\u201321","DOI":"10.1007\/978-3-319-59162-9_2"},{"issue":"3","key":"10243_CR75","doi-asserted-by":"publisher","first-page":"279","DOI":"10.4300\/JGME-D-12-00156.1","volume":"4","author":"GM Sullivan","year":"2012","unstructured":"Sullivan G M, Feinn R (2012) Using effect size\u2014or why the P value is not enough. J Grad Med Educ 4(3):279\u2013282","journal-title":"J Grad Med Educ"},{"issue":"7","key":"10243_CR76","doi-asserted-by":"publisher","first-page":"683","DOI":"10.1109\/TSE.2018.2794977","volume":"45","author":"C Tantithamthavorn","year":"2019","unstructured":"Tantithamthavorn C, McIntosh S, Hassan A E, Matsumoto K (2019) The impact of automated parameter optimization on defect prediction models. IEEE Trans Softw Eng 45(7):683\u2013711","journal-title":"IEEE Trans Softw Eng"},{"key":"10243_CR77","doi-asserted-by":"crossref","unstructured":"Tanwani A K, Afridi J, Shafiq M Z, Farooq M (2009) Guidelines to select machine learning scheme for classification of biomedical datasets. In: European conference on evolutionary computation, machine learning and data mining in bioinformatics. Springer, pp 128\u2013139","DOI":"10.1007\/978-3-642-01184-9_12"},{"issue":"1","key":"10243_CR78","first-page":"60","volume":"30","author":"C Tian","year":"2018","unstructured":"Tian C, Manfei X, Justin T, Hongyue W, Xiaohui N (2018) Relationship between omnibus and post-hoc tests: an investigation of performance of the F test in ANOVA. Shanghai Arch Psychiatry 30(1):60","journal-title":"Shanghai Arch Psychiatry"},{"key":"10243_CR79","doi-asserted-by":"publisher","first-page":"107245","DOI":"10.1016\/j.patcog.2020.107245","volume":"103","author":"N Tran","year":"2020","unstructured":"Tran N, Schneider J G, Weber I, Qin A (2020) Hyper-parameter optimization in classification: to-do or not-to-do. Pattern Recogn 103:107245","journal-title":"Pattern Recogn"},{"issue":"12","key":"10243_CR80","doi-asserted-by":"publisher","first-page":"4954","DOI":"10.1021\/acs.chemmater.0c01907","volume":"32","author":"AYT Wang","year":"2020","unstructured":"Wang A Y T, Murdock R J, Kauwe S K, Oliynyk A O, Gurlo A, Brgoch J, Persson K A, Sparks T D (2020) Machine learning for materials scientists: an introductory guide toward best practices. Chem Mater 32(12):4954\u20134965","journal-title":"Chem Mater"},{"key":"10243_CR81","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-43839-8","volume-title":"Design science methodology for information systems and software engineering","author":"RJ Wieringa","year":"2014","unstructured":"Wieringa R J (2014) Design science methodology for information systems and software engineering. Springer, London"},{"key":"10243_CR82","doi-asserted-by":"crossref","unstructured":"Wohlin C, Runeson P, H\u00f6st M, Ohlsson M C, Regnell B, Wessl\u00e9n A (2012) Experimentation in software engineering. Springer Science & Business Media","DOI":"10.1007\/978-3-642-29044-2"},{"key":"10243_CR83","doi-asserted-by":"crossref","unstructured":"Yao J, Shepperd M (2020) Assessing software defection prediction performance: why using the matthews correlation coefficient matters. In: International conference on the evaluation and assessment in software engineering, pp 120\u2013129","DOI":"10.1145\/3383219.3383232"},{"issue":"2","key":"10243_CR84","doi-asserted-by":"publisher","first-page":"87","DOI":"10.1023\/A:1023760326768","volume":"11","author":"D Zhang","year":"2003","unstructured":"Zhang D, Tsai J J (2003) Machine learning and software engineering. Softw Qual J 11(2):87\u2013119","journal-title":"Softw Qual J"},{"key":"10243_CR85","doi-asserted-by":"crossref","unstructured":"Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pp 91\u2013100","DOI":"10.1145\/1595696.1595713"}],"container-title":["Empirical Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10243-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10664-022-10243-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10243-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,10]],"date-time":"2023-01-10T04:39:54Z","timestamp":1673325594000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10664-022-10243-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,8]]},"references-count":85,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["10243"],"URL":"https:\/\/doi.org\/10.1007\/s10664-022-10243-1","relation":{},"ISSN":["1382-3256","1573-7616"],"issn-type":[{"type":"print","value":"1382-3256"},{"type":"electronic","value":"1573-7616"}],"subject":[],"published":{"date-parts":[[2022,11,8]]},"assertion":[{"value":"20 September 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 November 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflicts of interest to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"<!--Emphasis Type='Bold' removed-->Conflict of Interest"}}],"article-number":"3"}}