{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T00:28:56Z","timestamp":1777854536909,"version":"3.51.4"},"reference-count":30,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2021,3,2]],"date-time":"2021-03-02T00:00:00Z","timestamp":1614643200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/501100011033","name":"agencia estatal de investigaci\u00f3n","doi-asserted-by":"publisher","award":["TIN2017-84658-C2-1-R"],"award-info":[{"award-number":["TIN2017-84658-C2-1-R"]}],"id":[{"id":"10.13039\/501100011033","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010198","name":"Ministerio de Econom\u00eda, Industria y Competitividad, Gobierno de Espa\u00f1a","doi-asserted-by":"publisher","award":["TIN2017-84658-C2-1-R"],"award-info":[{"award-number":["TIN2017-84658-C2-1-R"]}],"id":[{"id":"10.13039\/501100010198","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","award":["TIN2017-84658-C2-1-R"],"award-info":[{"award-number":["TIN2017-84658-C2-1-R"]}],"id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008425","name":"Conseller\u00eda de Cultura, Educaci\u00f3n e Ordenaci\u00f3n Universitaria, Xunta de Galicia","doi-asserted-by":"publisher","award":["ED481B 2017\/018"],"award-info":[{"award-number":["ED481B 2017\/018"]}],"id":[{"id":"10.13039\/501100008425","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Information Science"],"published-print":{"date-parts":[[2023,4]]},"abstract":"<jats:p>\n                    Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS\/ML (\n                    <jats:italic>Computer Science<\/jats:italic>\n                    \/\n                    <jats:italic>Machine Learning<\/jats:italic>\n                    ) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/rdata.4spam.group\">https:\/\/rdata.4spam.group<\/jats:ext-link>\n                    to facilitate understanding of this study.\n                  <\/jats:p>","DOI":"10.1177\/0165551521998636","type":"journal-article","created":{"date-parts":[[2021,3,3]],"date-time":"2021-03-03T02:53:15Z","timestamp":1614739995000},"page":"285-301","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":2,"title":["Improvements for research data repositories: The case of text spam"],"prefix":"10.1177","volume":"49","author":[{"given":"Ismael","family":"V\u00e1zquez","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain"}]},{"given":"Mar\u00eda","family":"Novo-Lour\u00e9s","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain"}]},{"given":"Reyes","family":"Pav\u00f3n","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain"}]},{"given":"Rosal\u00eda","family":"Laza","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain"}]},{"given":"Jos\u00e9 Ram\u00f3n","family":"M\u00e9ndez","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6050-373X","authenticated-orcid":false,"given":"David","family":"Ruano-Ord\u00e1s","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Vigo, Spain; SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Spain"}]}],"member":"179","published-online":{"date-parts":[[2021,3,2]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2017.03.003"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.radphyschem.2019.108479"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2018.06.173"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1162\/99608f92.e38165eb."},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0078080"},{"key":"e_1_3_3_7_2","first-page":"25","article-title":"Functional requirements for research data repositories","volume":"8","author":"Kim S","year":"2018","unstructured":"Kim S. Functional requirements for research data repositories. Int J Knowl Cont Dev Technol 2018; 8: 25\u201336.","journal-title":"Int J Knowl Cont Dev Technol"},{"key":"e_1_3_3_8_2","first-page":"18","article-title":"Research data repositories: the what, when, why, and how","volume":"36","author":"Uzwyshyn R","year":"2016","unstructured":"Uzwyshyn R. Research data repositories: the what, when, why, and how. Comput Libr 2016; 36: 18\u201321.","journal-title":"Comput Libr"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.18"},{"key":"e_1_3_3_10_2","unstructured":"COAR Next Generation Repositories Working Group. Next generation repositories: behaviours and technical recommendations of the COAR next generation repositories Working Group 2017 https:\/\/www.coar-repositories.org\/files\/NGR-Final-Formatted-Report-cc.pdf"},{"key":"e_1_3_3_11_2","unstructured":"Repository Platforms for Research Data Interest Group of the Research Data Aliance. Matrix of use cases and functional requirements for research data repository platforms 2016 https:\/\/www.rd-alliance.org\/group\/repository-platforms-research-data-ig\/outcomes\/matrix-use-cases-and-functional-requirements"},{"key":"e_1_3_3_12_2","unstructured":"CoreTrustSeal. CoreTrustSeal requirements: periodic review 2019 2020 https:\/\/www.coretrustseal.org\/wp-content\/uploads\/2020\/10\/CoreTrustSeal-2019-Review-Report-v02_00.pdf"},{"key":"e_1_3_3_13_2","unstructured":"Harmsen H Keite C Schmidt C et al. Explanatory notes on the nestor seal for trustworthy digital archives 2013 http:\/\/nbn-resolving.de\/urn:nbn:de:0008-2013100901"},{"key":"e_1_3_3_14_2","unstructured":"The Consultative Committee for Space Data Systems (CCSDS). Audit and certification of trustworthy digital repositories 2011 https:\/\/public.ccsds.org\/pubs\/652x0m1.pdf"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-020-0486-7"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/2641190.2641198"},{"key":"e_1_3_3_17_2","first-page":"1","volume-title":"Proceedings of the 2014 IEEE international conference on computational intelligence and computing research","author":"Dudhabaware RS","unstructured":"Dudhabaware RS, Madankar MS. Review on natural language processing tasks for text documents. In: Proceedings of the 2014 IEEE international conference on computational intelligence and computing research, Coimbatore, India, 18\u201320 December 2014, pp. 1\u20135. New York: IEEE."},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324916000334"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00325"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2018.12.008"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cosrev.2018.06.001"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2017.2690342"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2016.05.001"},{"key":"e_1_3_3_24_2","first-page":"19","volume-title":"Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD\u201904)","author":"Clifton C","unstructured":"Clifton C, Kantarcio\u01e7lu M, Doan A et al. Privacy-preserving data integration and sharing. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD\u201904), Paris, 13 June 2004, p. 19. New York: ACM Press."},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2007.142"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335438"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0155036"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0194317"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1111\/bjet.12052"},{"key":"e_1_3_3_30_2","doi-asserted-by":"crossref","unstructured":"M\u00e9ndez JR Novo-Lour\u00e9s M Pav\u00f3n R et al. Natural Language Pre-processing Architecture (NLPA) 2019 http:\/\/doi.org\/10.5281\/zenodo.3356590","DOI":"10.1155\/2020\/2390941"},{"key":"e_1_3_3_31_2","unstructured":"Novo-Lour\u00e9s M M\u00e9ndez JR Lage Y et al. Big Data Preprocessing for Java (BDP4J) 2019 http:\/\/doi.org\/10.5281\/zenodo.3254754"}],"container-title":["Journal of Information Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551521998636","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/0165551521998636","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551521998636","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T23:09:33Z","timestamp":1777504173000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/0165551521998636"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,2]]},"references-count":30,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,4]]}},"alternative-id":["10.1177\/0165551521998636"],"URL":"https:\/\/doi.org\/10.1177\/0165551521998636","relation":{},"ISSN":["0165-5515","1741-6485"],"issn-type":[{"value":"0165-5515","type":"print"},{"value":"1741-6485","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,3,2]]}}}