{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T17:12:48Z","timestamp":1772039568242,"version":"3.50.1"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T00:00:00Z","timestamp":1590537600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T00:00:00Z","timestamp":1590537600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Empir Software Eng"],"published-print":{"date-parts":[[2020,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n<jats:title>Context<\/jats:title>\n<jats:p>Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Objective<\/jats:title>\n<jats:p>The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Method<\/jats:title>\n<jats:p>This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Results<\/jats:title>\n<jats:p>Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify \u201cengineered\u201d projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Conclusions<\/jats:title>\n<jats:p>It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.<\/jats:p>\n<\/jats:sec>","DOI":"10.1007\/s10664-020-09825-8","type":"journal-article","created":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T07:03:34Z","timestamp":1590563014000},"page":"2897-2929","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":26,"title":["PHANTOM: Curating GitHub for engineered software projects using time-series clustering"],"prefix":"10.1007","volume":"25","author":[{"given":"Peter","family":"Pickerill","sequence":"first","affiliation":[]},{"given":"Heiko Joshua","family":"Jungen","sequence":"additional","affiliation":[]},{"given":"Miros\u0142aw","family":"Ochodek","sequence":"additional","affiliation":[]},{"given":"Micha\u0142","family":"Ma\u0107kowiak","sequence":"additional","affiliation":[]},{"given":"Miroslaw","family":"Staron","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,5,27]]},"reference":[{"key":"9825_CR1","doi-asserted-by":"crossref","unstructured":"Casalnuovo C, Devanbu P, Oliveira A, Filkov V, Ray B (2015) Assert use in github projects. In: Proceedings of the 37th international conference on software engineering-volume 1, IEEE Press, pp 755\u2013766","DOI":"10.1109\/ICSE.2015.88"},{"key":"9825_CR2","doi-asserted-by":"publisher","unstructured":"Cito J, Schermann G, Wittern JE, Leitner P, Zumberi S, Gall HC (2017) An Empirical Analysis of the Docker Container Ecosystem on GitHub. In: IEEE international working conference on mining software repositories, pp 323\u2013333. https:\/\/doi.org\/10.1109\/MSR.2017.67","DOI":"10.1109\/MSR.2017.67"},{"key":"9825_CR3","doi-asserted-by":"publisher","unstructured":"Cosentino V, Canovas Izquierdo JL, Cabot J (2017) A systematic mapping study of software development with GitHub. IEEE Access. https:\/\/doi.org\/10.1109\/ACCESS.2017.2682323","DOI":"10.1109\/ACCESS.2017.2682323"},{"key":"9825_CR4","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1016\/j.ins.2013.02.030","volume":"239","author":"H Deng","year":"2013","unstructured":"Deng H, Runger G, Tuv E, Vladimir M (2013) A time series forest for classification and feature extraction. Inf Sci 239:142\u2013153. https:\/\/doi.org\/10.1016\/j.ins.2013.02.030, arXiv:1302.2277v2","journal-title":"Inf Sci"},{"key":"9825_CR5","doi-asserted-by":"crossref","unstructured":"Dyer R, Nguyen H A, Rajan H, Nguyen T N (2013) Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In: Proceedings of the 2013 international conference on software engineering, pp 422-431. IEEE Press","DOI":"10.1109\/ICSE.2013.6606588"},{"issue":"1","key":"9825_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2379776.2379788","volume":"45","author":"P Esling","year":"2012","unstructured":"Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv (CSUR) 45(1):1\u201334. https:\/\/doi.org\/10.1145\/2379776.2379788. http:\/\/dl.acm.org\/citation.cfm?doid=2379776.2379788%5Cn\nhttp:\/\/dl.acm.org\/citation.cfm?id=2379788","journal-title":"ACM Comput Surv (CSUR)"},{"issue":"4","key":"9825_CR7","doi-asserted-by":"publisher","first-page":"1009","DOI":"10.1007\/s10664-013-9245-0","volume":"19","author":"J Eyolfson","year":"2014","unstructured":"Eyolfson J, Tan L, Lam P (2014) Correlations between bugginess and time-based commit characteristics. Empir Softw Eng 19(4):1009\u20131039","journal-title":"Empir Softw Eng"},{"key":"9825_CR8","doi-asserted-by":"publisher","unstructured":"Feldt R, Staron M, Hult E, Liljegren T (2013) Supporting software decision meetings: Heatmaps for visualising test and code measurements. In: Proceedings - 39th Euromicro Conference Series on Software Engineering and Advanced Applications, SEAA. https:\/\/doi.org\/10.1109\/SEAA.2013.61","DOI":"10.1109\/SEAA.2013.61"},{"issue":"12","key":"9825_CR9","doi-asserted-by":"publisher","first-page":"3026","DOI":"10.1109\/TKDE.2014.2316504","volume":"26","author":"BD Fulcher","year":"2014","unstructured":"Fulcher BD, Jones NS (2014) Highly comparative feature-based time-series classification. IEEE Trans Knowl Data Eng 26(12):3026\u20133037. https:\/\/doi.org\/10.1109\/TKDE.2014.2316504, 1401.3531","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"9825_CR10","doi-asserted-by":"crossref","unstructured":"Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering - FSE \u201910, pp 147. http:\/\/portal.acm.org\/citation.cfm?doid=1882291.1882315","DOI":"10.1145\/1882291.1882315"},{"key":"9825_CR11","doi-asserted-by":"publisher","first-page":"1538","DOI":"10.1007\/s10664-018-9648-z","volume":"24","author":"M Gharehyazie","year":"2019","unstructured":"Gharehyazie M, Ray B, Keshani M et al (2019) Cross-project code clones in GitHub. Empir Software Eng 24:1538\u20131573. https:\/\/doi.org\/10.1007\/s10664-018-9648-z","journal-title":"Empir Software Eng"},{"key":"9825_CR12","unstructured":"GitHub (2018) Github terms of service - user documentation. https:\/\/help.github.com\/articles\/github-terms-of-service\/#c-acceptable-use"},{"key":"9825_CR13","doi-asserted-by":"publisher","unstructured":"Gonzalez D, Santos JC, Popovich A, Mirakhorli M, Nagappan M (2017) A Large-Scale Study on the Usage of Testing Patterns That Address Maintainability Attributes: Patterns for Ease of Modification, Diagnoses, and Comprehension. In: IEEE International Working Conference on Mining Software Repositories, pp 391\u2013401. https:\/\/doi.org\/10.1109\/MSR.2017.8, 1704.08412","DOI":"10.1109\/MSR.2017.8"},{"key":"9825_CR14","doi-asserted-by":"crossref","unstructured":"Gousios G (2013) The ghtorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories. http:\/\/dl.acm.org\/citation.cfm?id=2487085.2487132. IEEE Press, Piscataway, MSR \u201913, pp 233\u2013236","DOI":"10.1109\/MSR.2013.6624034"},{"key":"9825_CR15","doi-asserted-by":"crossref","unstructured":"Guo C (2008) Time series clustering based on ICA for stock data analysis pp 1\u20134","DOI":"10.1109\/WiCom.2008.2534"},{"key":"9825_CR16","doi-asserted-by":"crossref","unstructured":"Hebig R, Quang TH, Chaudron MR, Robles G, Fernandez MA (2016) The quest for open source projects that use uml: Mining github. In: Proceedings of the ACM\/IEEE 19th international conference on model driven engineering languages and systems, ACM, pp 173\u2013183","DOI":"10.1145\/2976767.2976778"},{"issue":"1","key":"9825_CR17","doi-asserted-by":"publisher","first-page":"75","DOI":"10.2307\/25148625","volume":"28","author":"AR Hevner","year":"2004","unstructured":"Hevner A R, March S T, Park J, Ram S (2004) Design science in information systems research. MIS Quart 28(1):75\u2013105","journal-title":"MIS Quart"},{"key":"9825_CR18","doi-asserted-by":"crossref","unstructured":"Kalliamvakou E, Damian D, Blincoe K, Singer L, German DM (2015) Open source-style collaborative development practices in commercial projects using github. In: Proceedings of the 37th International Conference on Software Engineering-Volume 1, IEEE Press, pp 574\u2013585","DOI":"10.1109\/ICSE.2015.74"},{"issue":"5","key":"9825_CR19","doi-asserted-by":"publisher","first-page":"2035","DOI":"10.1007\/s10664-015-9393-5","volume":"21","author":"E Kalliamvakou","year":"2016","unstructured":"Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2016) An in-depth study of the promises and perils of mining GitHub. Empir Softw Eng 21(5):2035\u20132071. https:\/\/doi.org\/10.1007\/s10664-015-9393-5. 2597073.2597074","journal-title":"Empir Softw Eng"},{"key":"9825_CR20","doi-asserted-by":"crossref","unstructured":"Kolassa C, Riehle D, Salim MA (2013) The empirical commit frequency distribution of open source projects. In: Proceedings of the 9th international symposium on open collaboration, ACM, pp 18","DOI":"10.1145\/2491055.2491073"},{"key":"9825_CR21","doi-asserted-by":"publisher","unstructured":"Macho C, McIntosh S, Pinzger M (2017) Extracting Build Changes with BUILDDIFF. In: IEEE international working conference on mining software repositories, pp 368\u2013378. https:\/\/doi.org\/10.1109\/MSR.2017.65, 1703.08527","DOI":"10.1109\/MSR.2017.65"},{"issue":"6","key":"9825_CR22","doi-asserted-by":"publisher","first-page":"3219","DOI":"10.1007\/s10664-017-9512-6","volume":"22","author":"N Munaiah","year":"2017","unstructured":"Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating GitHub for engineered software projects. Empir Softw Eng 22(6):3219\u20133253. https:\/\/doi.org\/10.1007\/s10664-017-9512-6","journal-title":"Empir Softw Eng"},{"key":"9825_CR23","doi-asserted-by":"publisher","unstructured":"Noten J, Mengerink JG, Serebrenik A (2017) A data set of OCL expressions on GitHub. In: IEEE international working conference on mining software repositories, pp 531\u2013534 . https:\/\/doi.org\/10.1109\/MSR.2017.52","DOI":"10.1109\/MSR.2017.52"},{"key":"9825_CR24","doi-asserted-by":"publisher","unstructured":"Nu\u00f1ez-Varela AS, P\u00e9rez-Gonzalez HG, Mart\u00ednez-Perez FE, Soubervielle-Montalvo C (2017) Source code metrics: A systematic mapping study. Journal of Systems and Software. https:\/\/doi.org\/10.1016\/j.jss.2017.03.044","DOI":"10.1016\/j.jss.2017.03.044"},{"key":"9825_CR25","doi-asserted-by":"crossref","unstructured":"Padhye R, Mani S, Sinha VS (2014) A study of external community contribution to open-source projects on github. In: Proceedings of the 11th working conference on mining software repositories, ACM, pp 332\u2013335","DOI":"10.1145\/2597073.2597113"},{"key":"9825_CR26","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825\u20132830","journal-title":"J Mach Learn Res"},{"key":"9825_CR27","doi-asserted-by":"publisher","unstructured":"Ratanamahatana C, Keogh E (2004) Everything you know about dynamic time warping is wrong. 3rd workshop on mining temporal and sequential data pp 22\u201325 . https:\/\/doi.org\/10.1097\/01.CCM.0000279204.24648.44. http:\/\/spoken-number-recognition.googlecode.com\/svn\/trunk\/docs\/Dynamictimewarping\/DTW_myths.pdf","DOI":"10.1097\/01.CCM.0000279204.24648.44"},{"key":"9825_CR28","doi-asserted-by":"publisher","unstructured":"Rausch T, Hummer W, Leitner P, Schulte S (2017) An Empirical Analysis of Build Failures in the Continuous Integration Workflows of Java-Based Open-Source Software. In: IEEE International Working Conference on Mining Software Repositories, pp 345\u2013355. https:\/\/doi.org\/10.1109\/MSR.2017.54","DOI":"10.1109\/MSR.2017.54"},{"key":"9825_CR29","doi-asserted-by":"crossref","unstructured":"Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, ACM, pp 155\u2013165","DOI":"10.1145\/2635868.2635922"},{"key":"9825_CR30","doi-asserted-by":"publisher","unstructured":"Robles G, Ho-Quang T, Hebig R, Chaudron M R, Fernandez M A (2017) An extensive dataset of UML models in GitHub. IEEE international working conference on mining software repositories, pp 519\u2013522. https:\/\/doi.org\/10.1109\/MSR.2017.48","DOI":"10.1109\/MSR.2017.48"},{"issue":"10","key":"9825_CR31","doi-asserted-by":"publisher","first-page":"e0205898","DOI":"10.1371\/journal.pone.0205898","volume":"13","author":"PH Russell","year":"2018","unstructured":"Russell P H, Johnson R L, Ananthan S, Harnke B, Carlson N E (2018) A large-scale analysis of bioinformatics code on github. PloS One 13(10):e0205898","journal-title":"PloS One"},{"key":"9825_CR32","doi-asserted-by":"publisher","unstructured":"Sadat M, Bener AB, Miranskyy A (2017) Rediscovery datasets: Connecting duplicate reports. In: IEEE international working conference on mining software repositories, pp 527\u2013530. https:\/\/doi.org\/10.1109\/MSR.2017.50, 1703.06337","DOI":"10.1109\/MSR.2017.50"},{"key":"9825_CR33","doi-asserted-by":"crossref","unstructured":"Sajnani H, Saini V, Ossher J, Lopes CV (2014) Is popularity a measure of quality? an analysis of maven components. In: 2014 IEEE international conference on software maintenance and evolution, IEEE, pp 231\u2013240","DOI":"10.1109\/ICSME.2014.45"},{"key":"9825_CR34","doi-asserted-by":"crossref","unstructured":"Shimagaki J, Kamei Y, McIntosh S, Pursehouse D, Ubayashi N (2016) Why are commits being reverted?: a comparative study of industrial and open source projects. In: 2016 IEEE international conference on software maintenance and evolution, ICSME. IEEE, pp 301\u2013311","DOI":"10.1109\/ICSME.2016.83"},{"key":"9825_CR35","doi-asserted-by":"crossref","unstructured":"Silva D, Tsantalis N, Valente MT (2016) Why we refactor? confessions of github contributors. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, ACM, pp 858\u2013870","DOI":"10.1145\/2950290.2950305"},{"key":"9825_CR36","doi-asserted-by":"publisher","unstructured":"Staron M, Hansson J, Feldt R, Meding W, Henriksson A, Nilsson S, Ho\u0307glund C (2013a) Measuring and visualizing code stability - A case study at three companies. In: Proceedings - joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, IWSM-MENSURA. https:\/\/doi.org\/10.1109\/IWSM-Mensura.2013.35","DOI":"10.1109\/IWSM-Mensura.2013.35"},{"key":"9825_CR37","doi-asserted-by":"publisher","unstructured":"Staron M, Meding W, Hoglund C, Eriksson P, Nilsson J, Hansson J (2013) Identifying implicit architectural dependencies using measures of source code change waves. In: Proceedings - 39th Euromicro conference series on software engineering and advanced applications, SEAA. https:\/\/doi.org\/10.1109\/SEAA.2013.9","DOI":"10.1109\/SEAA.2013.9"},{"key":"9825_CR38","doi-asserted-by":"crossref","unstructured":"Vasilescu B, Yu Y, Wang H, Devanbu P, Filkov V (2015) Quality and productivity outcomes relating to continuous integration in github. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, pp 805\u2013816","DOI":"10.1145\/2786805.2786850"},{"issue":"3","key":"9825_CR39","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1007\/s10618-005-0039-x","volume":"13","author":"X Wang","year":"2006","unstructured":"Wang X, Smith K, Hyndman R (2006) Characteristic-based clustering for time series data. Data Min Knowl Disc 13(3):335\u2013364. 10.1007\/s10618-005-0039-x","journal-title":"Data Min Knowl Disc"},{"key":"9825_CR40","doi-asserted-by":"publisher","unstructured":"Wieringa R (2014) Design science methodology for information systems and software engineering. https:\/\/doi.org\/10.1145\/1810295.1810446. http:\/\/portal.acm.org\/citation.cfm?doid=1810295.1810446","DOI":"10.1145\/1810295.1810446"},{"key":"9825_CR41","doi-asserted-by":"crossref","unstructured":"Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B (2015) Wait for it: Determinants of pull request evaluation latency on github. In: 2015 IEEE\/ACM 12th working conference on mining software repositories, IEEE, pp 367\u2013371","DOI":"10.1109\/MSR.2015.42"},{"key":"9825_CR42","doi-asserted-by":"crossref","unstructured":"Zhao Y, Serebrenik A, Zhou Y, Filkov V, Vasilescu B (2017) The impact of continuous integration on other software development practices: a large-scale empirical study. In: Proceedings of the 32nd IEEE\/ACM international conference on automated software engineering, IEEE Press, pp 60\u201371","DOI":"10.1109\/ASE.2017.8115619"},{"key":"9825_CR43","doi-asserted-by":"publisher","unstructured":"Zhu C, Li Y, Rubin J, Chechik M (2017) A dataset for dynamic discovery of semantic changes in version controlled software histories. In: IEEE international working conference on mining software repositories, pp 523\u2013526. https:\/\/doi.org\/10.1109\/MSR.2017.49","DOI":"10.1109\/MSR.2017.49"}],"container-title":["Empirical Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-020-09825-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10664-020-09825-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-020-09825-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,5,27]],"date-time":"2021-05-27T00:05:35Z","timestamp":1622073935000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10664-020-09825-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,27]]},"references-count":43,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,7]]}},"alternative-id":["9825"],"URL":"https:\/\/doi.org\/10.1007\/s10664-020-09825-8","relation":{},"ISSN":["1382-3256","1573-7616"],"issn-type":[{"value":"1382-3256","type":"print"},{"value":"1573-7616","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,27]]},"assertion":[{"value":"27 May 2020","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}