{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T17:06:18Z","timestamp":1761930378050,"version":"build-2065373602"},"reference-count":48,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,2,20]],"date-time":"2025-02-20T00:00:00Z","timestamp":1740009600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>The amount of software engineering data is constantly growing, as more and more developers employ online services to store their code, keep track of bugs, or even discuss issues. The data residing in these services can be mined to address different research challenges; therefore, certain initiatives have been established to encourage sharing research datasets collecting them. In this work, we investigate the effect of such an initiative; we create a directory that includes the papers and the corresponding datasets of the data track of the Mining Software Engineering (MSR) conference. Specifically, our directory includes metadata and citation information for the papers of all data tracks, throughout the last twelve years. We also annotate the datasets according to the data source and further assess their compliance to the FAIR principles. Using our directory, researchers can find useful datasets for their research, or even design methodologies for assessing their quality, especially in the software engineering domain. Moreover, the directory can be used for analyzing the citations of data papers, especially with regard to different data categories, as well as for examining their FAIRness score throughout the years, along with its effect on the usage\/citation of the datasets.<\/jats:p>","DOI":"10.3390\/data10030028","type":"journal-article","created":{"date-parts":[[2025,2,20]],"date-time":"2025-02-20T06:10:21Z","timestamp":1740031821000},"page":"28","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A Directory of Datasets for Mining Software Repositories"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0520-7225","authenticated-orcid":false,"given":"Themistoklis","family":"Diamantopoulos","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0235-6046","authenticated-orcid":false,"given":"Andreas L.","family":"Symeonidis","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1109\/MS.2005.153","article-title":"Guest Editor\u2019s Introduction: The Promise of Public Software Engineering Data Repositories","volume":"22","author":"Cukic","year":"2005","journal-title":"IEEE Softw."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"3288","DOI":"10.1007\/s10664-020-09834-7","article-title":"Standing on shoulders or feet? An extended study on the usage of the MSR data papers","volume":"25","author":"Kotti","year":"2020","journal-title":"Empir. Softw. Eng."},{"key":"ref_3","unstructured":"Sayyad Shirabad, J., and Menzies, T. (2005). The PROMISE Repository of Software Engineering Databases, School of Information Technology and Engineering, University of Ottawa."},{"key":"ref_4","unstructured":"(2025, February 19). European Organization For Nuclear Research and OpenAIRE. Zenodo. Available online: https:\/\/doi.org\/10.25495\/7GXK-RD71."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1023\/B:SCIE.0000041647.01086.f4","article-title":"Global knowledge management research: A bibliometric analysis","volume":"61","author":"Gu","year":"2004","journal-title":"Scientometrics"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Robles, G. (2010, January 2\u20133). Replicating MSR: A study of the potential replicability of papers published in the Mining Software Repositories proceedings. Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), Cape Town, South Africa.","DOI":"10.1109\/MSR.2010.5463348"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"de Freitas, F.G., and de Souza, J.T. (2011, January 10\u201312). Ten years of search based software engineering: A bibliometric analysis. Proceedings of the Third International Conference on Search Based Software Engineering, Szeged, Hungary.","DOI":"10.1007\/978-3-642-23716-4_5"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Kotti, Z., and Spinellis, D. (2019, January 26\u201327). Standing on shoulders or feet? The usage of the MSR data papers. Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, ON, Canada.","DOI":"10.1109\/MSR.2019.00085"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zogaan, W., Sharma, P., Mirahkorli, M., and Arnaoudova, V. (2017, January 4\u20138). Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality. Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal.","DOI":"10.1109\/RE.2017.80"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liebchen, G.A., and Shepperd, M. (2008, January 12\u201313). Data sets and data quality in software engineering. Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, New York, NY, USA.","DOI":"10.1145\/1370788.1370799"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_12","unstructured":"Sun, C., Emonet, V., and Dumontier, M. (2022, January 13\u201316). A comprehensive comparison of automated FAIRness Evaluation Tools. Proceedings of the Semantic Web Applications and Tools for Health Care and Life Sciences, Rheinisch-Westfaelische Technische Hochschule Aachen * Lehrstuhl Informatik V, Basel, Switzerland."},{"key":"ref_13","unstructured":"International DOI Foundation (2023). The DOI\u00ae Handbook, International DOI Foundation."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Markovtsev, V., and Long, W. (2018, January 28\u201329). Public git archive: A big code dataset for all. Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, Sweden.","DOI":"10.1145\/3196398.3196464"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Manasa Venigalla, A.S., and Chimalakonda, S. (2023, January 15\u201316). DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories. Proceedings of the 2023 IEEE\/ACM 20th International Conference on Mining Software Repositories, Los Alamitos, CA, USA.","DOI":"10.1109\/MSR59073.2023.00062"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Karampatsis, R.M., and Sutton, C. (2020, January 25\u201326). How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea.","DOI":"10.1145\/3379597.3387491"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Diamantopoulos, T., Nastos, D.N., and Symeonidis, A. (2023, January 15\u201316). Semantically-enriched Jira Issue Tracking Data. Proceedings of the 2023 IEEE\/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia.","DOI":"10.1109\/MSR59073.2023.00039"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Warrick, M., Rosenblatt, S.F., Young, J.G., Casari, A., H\u00e9bert-Dufresne, L., and Bagrow, J. (2022, January 23\u201324). The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA.","DOI":"10.1145\/3524842.3528479"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Squire, M. (2013, January 18\u201319). Project roles in the apache software foundation: A dataset. Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA.","DOI":"10.1109\/MSR.2013.6624042"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Spinellis, D. (2015, January 16\u201317). A repository with 44 years of Unix evolution. Proceedings of the 12th Working Conference on Mining Software Repositories, Florence, Italy.","DOI":"10.1109\/MSR.2015.64"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Spinellis, D. (2018, January 28\u201329). Documented unix facilities over 48 years. Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, Sweden.","DOI":"10.1145\/3196398.3196476"},{"key":"ref_22","first-page":"993","article-title":"Latent Dirichlet Allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_23","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013, January 2\u20134). Efficient Estimation of Word Representations in Vector Space. Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017, January 3\u20137). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Valencia, Spain.","DOI":"10.18653\/v1\/E17-2068"},{"key":"ref_25","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Kamp, M., Kreutzer, P., and Philippsen, M. (2019, January 26\u201327). SeSaMe: A data set of semantically similar Java methods. Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, ON, Canada.","DOI":"10.1109\/MSR.2019.00079"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Efstathiou, V., Chatzilenas, C., and Spinellis, D. (2018, January 28\u201329). Word embeddings for the software engineering domain. Proceedings of the 15th International Conference on Mining Software Repositories, New York, NY, USA.","DOI":"10.1145\/3196398.3196448"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Henkel, J., Bird, C., Lahiri, S.K., and Reps, T. (2020, January 25\u201326). A Dataset of Dockerfiles. Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea.","DOI":"10.1145\/3379597.3387498"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Quaranta, L., Calefato, F., and Lanubile, F. (2021, January 22\u201330). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. Proceedings of the 2021 IEEE\/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain.","DOI":"10.1109\/MSR52588.2021.00072"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zacchiroli, S. (2022, January 23\u201324). A large-scale dataset of (open source) license text variants. Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA.","DOI":"10.1145\/3524842.3528491"},{"key":"ref_31","unstructured":"Rehurek, R., and Sojka, P. (2011). Gensim\u2013Python Framework for Vector Space Modelling, NLP Centre, Faculty of Informatics, Masaryk University."},{"key":"ref_32","unstructured":"R\u00f6der, M., Both, A., and Hinneburg, A. (February, January 31). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1038\/s41597-019-0184-5","article-title":"Evaluating FAIR maturity through a scalable, automated, community-governed framework","volume":"6","author":"Wilkinson","year":"2019","journal-title":"Sci. Data"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Gaignard, A., Rosnet, T., De Lamotte, F., Lefort, V., and Devignes, M.D. (2023). FAIR-Checker: Supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards. J. Biomed. Semant., 14.","DOI":"10.1186\/s13326-023-00289-5"},{"key":"ref_35","first-page":"20","article-title":"From Conceptualization to Implementation: FAIR Assessment of Research Data Objects","volume":"4","author":"Devaraju","year":"2021","journal-title":"Data Sci. J."},{"key":"ref_36","unstructured":"Devaraju, A., Huber, R., Mokrane, M., Herterich, P., Cepinskas, L., de Vries, J., L\u2019Hours, H., Davidson, J., and White, A. (2022). FAIRsFAIR Data Object Assessment Metrics, FAIRsFAIR."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"100370","DOI":"10.1016\/j.patter.2021.100370","article-title":"An automated solution for measuring the progress toward FAIR research data","volume":"2","author":"Devaraju","year":"2021","journal-title":"Patterns"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Diamantopoulos, T., Papamichail, M.D., Karanikiotis, T., Chatzidimitriou, K.C., and Symeonidis, A.L. (2020, January 25\u201326). Employing Contribution and Quality Metrics for Quantifying the Software Development Process. Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea.","DOI":"10.1145\/3379597.3387490"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Allix, K., Bissyand\u00e9, T.F., Klein, J., and Le Traon, Y. (2016, January 14\u201315). AndroZoo: Collecting millions of Android apps for the research community. Proceedings of the 13th International Conference on Mining Software Repositories, New York, NY, USA.","DOI":"10.1145\/2901739.2903508"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Gousios, G. (2013, January 18\u201319). The GHTorent dataset and tool suite. Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA.","DOI":"10.1109\/MSR.2013.6624034"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Ehrlinger, L., and W\u00f6\u00df, W. (2022). A Survey of Data Quality Measurement and Monitoring Tools. Front. Big Data, 5.","DOI":"10.3389\/fdata.2022.850611"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Cavanillas, J.M., Curry, E., and Wahlster, W. (2016). Big Data Curation. New Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe, Springer International Publishing.","DOI":"10.1007\/978-3-319-21569-3"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1145\/1541880.1541883","article-title":"Methodologies for data quality assessment and improvement","volume":"41","author":"Batini","year":"2009","journal-title":"ACM Comput. Surv."},{"key":"ref_44","first-page":"e1191","article-title":"Data curation in the Internet of Things: A decision model approach","volume":"3","year":"2021","journal-title":"Comput. Math. Methods"},{"key":"ref_45","first-page":"19","article-title":"Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation","volume":"11","author":"Bosu","year":"2019","journal-title":"J. Data Inf. Qual."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"145614","DOI":"10.1109\/ACCESS.2019.2945911","article-title":"Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering","volume":"7","author":"Onan","year":"2019","journal-title":"IEEE Access"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Cachola, I., Lo, K., Cohan, A., and Weld, D. (2020, January 16\u201320). TLDR: Extreme Summarization of Scientific Documents. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online.","DOI":"10.18653\/v1\/2020.findings-emnlp.428"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"102126","DOI":"10.1016\/j.ecoinf.2023.102126","article-title":"FAIR degree assessment in agriculture datasets using the F-UJI tool","volume":"76","author":"Petrosyan","year":"2023","journal-title":"Ecol. Inform."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/3\/28\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:38:41Z","timestamp":1760027921000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/3\/28"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,20]]},"references-count":48,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["data10030028"],"URL":"https:\/\/doi.org\/10.3390\/data10030028","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,2,20]]}}}