{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T01:17:20Z","timestamp":1773883040475,"version":"3.50.1"},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T00:00:00Z","timestamp":1711497600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T00:00:00Z","timestamp":1711497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hochschule f\u00fcr angewandte Wissenschaften M\u00fcnchen"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["SN COMPUT. SCI."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Nowadays, the use of third-party libraries in software is common. At the same time, the number of published libraries continues to increase. An automated classification should help to maintain an overview and identify similar software libraries. This paper investigates if new approaches can be used to classify all software libraries crawled from Apache Maven repositories into defined classes using machine learning. In addition to tags that are not always available or of poor quality, we examine one feature that is always available\u2014the id. Consisting of group-id and artifact-id, the id of an Apache Maven software library contains valuable information that can help in classification. Through a developed preprocessing and an optimized recurrent neural network (RNN), the tokenised ids should allow a classification of most libraries. Furthermore, we present an optimized approach through a hybrid use of id tokens and tags in combination. Based on the dataset including 28,600 labeled entries, a comparison of various approaches was carried out. The RNN achieved a balanced accuracy of 71.36% by training on tokenised ids. A model trained on tags achieved a balanced accuracy of 92%. However, the new hybrid approach, which combines tags and ids, optimizes the result to 94.12%. While a classification on tags achieves a better result than the more general id-based approach, the applicability is limited to software libraries that are tagged. The hybrid approach, on the other hand, takes advantage of the classification results based on tags when these are available, but includes valuable information from the always available ids.<\/jats:p>","DOI":"10.1007\/s42979-024-02654-2","type":"journal-article","created":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T14:01:42Z","timestamp":1711548102000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Towards an Automated Classification of Software Libraries"],"prefix":"10.1007","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4860-7464","authenticated-orcid":false,"given":"Maximilian","family":"Auch","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6837-0628","authenticated-orcid":false,"given":"Maximilian","family":"Balluff","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4508-7667","authenticated-orcid":false,"given":"Peter","family":"Mandl","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7278-8595","authenticated-orcid":false,"given":"Christian","family":"Wolff","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,3,27]]},"reference":[{"issue":"3","key":"2654_CR1","doi-asserted-by":"publisher","first-page":"2341","DOI":"10.1007\/s10664-019-09754-1","volume":"25","author":"P Salza","year":"2020","unstructured":"Salza P, Palomba F, Di Nucci D, de Lucia A, Ferrucci F. Third-party libraries in mobile apps. Empir Softw Eng. 2020;25(3):2341\u201377. https:\/\/doi.org\/10.1007\/s10664-019-09754-1.","journal-title":"Empir Softw Eng"},{"key":"2654_CR2","doi-asserted-by":"publisher","unstructured":"Thung F, Lo D, Lawall J. Automated library recommendation. In: 2013 20th Working Conference on reverse engineering (WCRE), 2013; pp. 182\u2013191. https:\/\/doi.org\/10.1109\/WCRE.2013.6671293.","DOI":"10.1109\/WCRE.2013.6671293"},{"key":"2654_CR3","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2020.110669","volume":"168","author":"M Auch","year":"2020","unstructured":"Auch M, Weber M, Mandl P, Wolff C. Similarity-based analyses on software applications: a systematic literature review. J Syst Soft. 2020;168:110669","journal-title":"J Syst Soft."},{"key":"2654_CR4","doi-asserted-by":"publisher","unstructured":"Yu H, Xia X, Zhao X, Qiu W. Combining collaborative filtering and topic modeling for more accurate android mobile app library recommendation. In: Mei H, editor. Proceedings of the 9th Asia-Pacific Symposium on Internetware. New York, NY: ACM Digital Library, ACM; 2017. p. 1\u20136. https:\/\/doi.org\/10.1145\/3131704.3131721.","DOI":"10.1145\/3131704.3131721"},{"key":"2654_CR5","doi-asserted-by":"publisher","unstructured":"Escobar-Avila J. Automatic categorization of software libraries using bytecode. In: 2015 IEEE\/ACM 37th IEEE International Conference on software engineering, 2015;2:784\u20136.https:\/\/doi.org\/10.1109\/ICSE.2015.249.","DOI":"10.1109\/ICSE.2015.249"},{"key":"2654_CR6","doi-asserted-by":"publisher","unstructured":"Auch M, Balluff M, Mandl P, Wolff C. Similarity of software libraries: a tag-based classification approach. In: Quix C, editor. DATA 2021; 17\u201328. SCITEPRESS-Science and Technology Publications Lda, Set\u00fabal, Portugal, 2021. https:\/\/doi.org\/10.5220\/0010521600170028.","DOI":"10.5220\/0010521600170028"},{"key":"2654_CR7","doi-asserted-by":"publisher","unstructured":"Vel\u00e1zquez-Rodr\u00edguez C, De Roover C. Mutama: an automated multi-label tagging approach for software libraries on maven. In: 2020 IEEE 20th International Working Conference on source code analysis and manipulation (SCAM), 2020; 254\u2013 258. https:\/\/doi.org\/10.1109\/SCAM51674.2020.00034.","DOI":"10.1109\/SCAM51674.2020.00034"},{"key":"2654_CR8","unstructured":"Sanchez C. Maven\u2014guide to naming conventions 2005. https:\/\/maven.apache.org\/guides\/mini\/guide-naming-conventions.html#guide-to-naming-conventions-on-groupid-artifactid-and-version. Accessed 9 Nov 2021."},{"key":"2654_CR9","unstructured":"Gosling J, Joy B, Steele G, Bracha G, Buckley A, Smith D, Bierman G. The Java\u00ae Language Specification 2021. https:\/\/docs.oracle.com\/javase\/specs\/jls\/se17\/html\/index.html. Accessed 9 Nov 2021."},{"key":"2654_CR10","unstructured":"Sutskever I, Vinyals O, Le VQ. Sequence to sequence learning with neural networks. arXiv: org\/pdf\/1409.3215v3."},{"key":"2654_CR11","volume-title":"Deep learning. Adaptive computation and machine learning","author":"I Goodfellow","year":"2016","unstructured":"Goodfellow I, Bengio Y, Courville A. Deep learning. Adaptive computation and machine learning. Cambridge: The MIT Press; 2016."},{"issue":"8","key":"2654_CR12","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735\u201380. https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735.","journal-title":"Neural Comput"},{"key":"2654_CR13","unstructured":"Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv: org\/pdf\/1406.1078."},{"issue":"4","key":"2654_CR14","doi-asserted-by":"publisher","first-page":"306","DOI":"10.1002\/gepi.20211","volume":"31","author":"DR Velez","year":"2007","unstructured":"Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007;31(4):306\u201315. https:\/\/doi.org\/10.1002\/gepi.20211.","journal-title":"Genet Epidemiol"},{"key":"2654_CR15","unstructured":"spaCy . Industrial-strength Natural Language Processing in Python (22.11.2021). https:\/\/spacy.io\/. Accessed 22 Nov 2021."},{"issue":"1","key":"2654_CR16","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1186\/1471-2105-7-91","volume":"7","author":"S Varma","year":"2006","unstructured":"Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006;7(1):91. https:\/\/doi.org\/10.1186\/1471-2105-7-91.","journal-title":"BMC Bioinform"},{"issue":"4","key":"2654_CR17","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","volume":"45","author":"M Sokolova","year":"2009","unstructured":"Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inform Process Manag. 2009;45(4):427\u201337. https:\/\/doi.org\/10.1016\/j.ipm.2009.03.002.","journal-title":"Inform Process Manag"},{"key":"2654_CR18","doi-asserted-by":"publisher","unstructured":"Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. In: 2010 20th International Conference on pattern recognition (ICPR 2010), IEEE, Piscataway, NJ, 2010; 3121\u2013 3124. https:\/\/doi.org\/10.1109\/ICPR.2010.764.","DOI":"10.1109\/ICPR.2010.764"},{"issue":"200","key":"2654_CR19","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1080\/01621459.1937.10503522","volume":"32","author":"M Friedman","year":"1937","unstructured":"Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32(200):675\u2013701.","journal-title":"J Am Stat Assoc"},{"key":"2654_CR20","volume-title":"Distribution-free multiple comparisons","author":"PB Nemenyi","year":"1963","unstructured":"Nemenyi PB. Distribution-free multiple comparisons. Princeton University; 1963."},{"key":"2654_CR21","first-page":"1","volume":"7","author":"J Demar","year":"2006","unstructured":"Demar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1\u201330.","journal-title":"J Mach Learn Res"},{"key":"2654_CR22","unstructured":"Maven Repository: javax.inject:javax.inject (08.12.2021). https:\/\/mvnrepository.com\/artifact\/javax.inject\/javax.inject Accessed 08.12.2021"}],"container-title":["SN Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-024-02654-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42979-024-02654-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-024-02654-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T14:06:40Z","timestamp":1711548400000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42979-024-02654-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,27]]},"references-count":22,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["2654"],"URL":"https:\/\/doi.org\/10.1007\/s42979-024-02654-2","relation":{},"ISSN":["2661-8907"],"issn-type":[{"value":"2661-8907","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,27]]},"assertion":[{"value":"9 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 March 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"During his participation in this research project, Maximilian Auch had a part-time employment relationship with Ausy Technologies Germany. Maximilian Balluff also had a part-time employment relationship with IT4IPM - IT for Intellectual Property Management GmbH. In addition to these employments, which had no direct influence on the study, the authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"339"}}