{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T03:18:31Z","timestamp":1740107911000,"version":"3.37.3"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2021,4,23]],"date-time":"2021-04-23T00:00:00Z","timestamp":1619136000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,4,23]],"date-time":"2021-04-23T00:00:00Z","timestamp":1619136000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100007052","name":"Universit\u00e0 degli Studi di Verona","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100007052","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Digit Libr"],"published-print":{"date-parts":[[2021,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library \u201cCultura Italia\u201d and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\sim $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>\u223c<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>\u00a00.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance.<\/jats:p>","DOI":"10.1007\/s00799-021-00302-1","type":"journal-article","created":{"date-parts":[[2021,4,23]],"date-time":"2021-04-23T03:37:56Z","timestamp":1619149076000},"page":"217-231","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Automatically evaluating the quality of textual descriptions in cultural heritage records"],"prefix":"10.1007","volume":"22","author":[{"given":"Matteo","family":"Lorenzini","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9391-3201","authenticated-orcid":false,"given":"Marco","family":"Rospocher","sequence":"additional","affiliation":[]},{"given":"Sara","family":"Tonelli","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,4,23]]},"reference":[{"key":"302_CR1","doi-asserted-by":"publisher","unstructured":"Adankon, M.M., Cheriet, M.: Support Vector Machine, pp. 1303\u20131308. Springer US, Boston, MA (2009). https:\/\/doi.org\/10.1007\/978-0-387-73003-5_299","DOI":"10.1007\/978-0-387-73003-5_299"},{"key":"302_CR2","doi-asserted-by":"crossref","unstructured":"Aprosio, A.P., Moretti, G.: Tint 2.0: an all-inclusive suite for NLP in italian. In: Cabrio, E., Mazzei, A., Tamburini, F. (eds.) Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), Torino, Italy, December 10\u201312, 2018, CEUR Workshop Proceedings, vol. 2253. CEUR-WS.org (2018). URL http:\/\/ceur-ws.org\/Vol-2253\/paper58.pdf","DOI":"10.4000\/books.aaccademia.3571"},{"issue":"1","key":"302_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.websem.2008.02.005","volume":"7","author":"C Bizer","year":"2009","unstructured":"Bizer, C., Cyganiak, R.: Quality-driven information filtering using the wiqa policy framework. Web Semant. 7(1), 1\u201310 (2009). https:\/\/doi.org\/10.1016\/j.websem.2008.02.005","journal-title":"Web Semant."},{"key":"302_CR4","doi-asserted-by":"crossref","unstructured":"Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135\u2013146 (2017). https:\/\/www.aclweb.org\/anthology\/Q17-1010\/","DOI":"10.1162\/tacl_a_00051"},{"key":"302_CR5","unstructured":"Bruce, T.R., Hillmann, D.I.: The continuum of metadata quality: defining, expressing, exploiting. ALA editions (2004). https:\/\/ecommons.cornell.edu\/handle\/1813\/7895"},{"key":"302_CR6","doi-asserted-by":"crossref","unstructured":"Chan, L.M., Zeng, M.L.: Metadata interoperability and standardization-a study of methodology part i. D-Lib Mag. 12(6), 1082\u20139873 (2006). https:\/\/dlib.org\/dlib\/juneo6\/chqn\/06chqn.html","DOI":"10.1045\/june2006-chan"},{"key":"302_CR7","unstructured":"Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(null), 2493-2537 (2011). https:\/\/www.jmlr.org\/papers\/volume12\/collobert11a\/collobert11a.pdf"},{"issue":"3","key":"302_CR8","doi-asserted-by":"publisher","first-page":"273","DOI":"10.1023\/A:1022627411411","volume":"20","author":"C Cortes","year":"1995","unstructured":"Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273\u2013297 (1995). https:\/\/doi.org\/10.1023\/A:1022627411411","journal-title":"Mach. Learn."},{"key":"302_CR9","doi-asserted-by":"crossref","unstructured":"Custard, M., Sumner, T.: Using machine learning to support quality judgments. D Lib Mag. 11 (2005). URL https:\/\/dlib.org\/dlib\/october05\/custard\/10custard.html","DOI":"10.1045\/october2005-custard"},{"key":"302_CR10","unstructured":"Day, M., Guy, M., Powell, A.: Improving the quality of metadata in eprint archives. Ariadne 38 (2004). http:\/\/www.ariadne.ac.uk\/issue\/38\/guy\/"},{"key":"302_CR11","unstructured":"Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171\u20134186 (2019). https:\/\/www.aclweb.org\/anthology\/N19-1423.pdf"},{"issue":"3","key":"302_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3012289","volume":"10","author":"M Dragoni","year":"2017","unstructured":"Dragoni, M., Tonelli, S., Moretti, G.: A knowledge management architecture for digital cultural heritage. J. Comput. Cult. Heritage (JOCCH) 10(3), 1\u201318 (2017)","journal-title":"J. Comput. Cult. Heritage (JOCCH)"},{"key":"302_CR13","doi-asserted-by":"publisher","unstructured":"Foulonneau, M.: Information redundancy across metadata collections. Inf. Process. Manag. 43(3), 740\u2013751 (2007). https:\/\/doi.org\/10.1016\/j.ipm.2006.06.004. http:\/\/www.sciencedirect.com\/science\/article\/pii\/S030645730600094X. Special Issue on Heterogeneous and Distributed IR","DOI":"10.1016\/j.ipm.2006.06.004"},{"key":"302_CR14","doi-asserted-by":"crossref","unstructured":"Gavrilis, D., Makri, D.N., Papachristopoulos, L., Angelis, S., Kravvaritis, K., Papatheodorou, C., Constantopoulos, P.: Measuring quality in metadata repositories. In: International Conference on Theory and Practice of Digital Libraries, pp. 56\u201367. Springer (2015). https:\/\/link.springer.com\/chapter\/10.1007\/978-3-319-24592-8_5","DOI":"10.1007\/978-3-319-24592-8_5"},{"key":"302_CR15","doi-asserted-by":"crossref","unstructured":"Ishida, Y., Shimizu, T., Yoshikawa, M.: An analysis and comparison of keyword recommendation methods for scientific data. Int. J. Digital Libraries 1\u201321 (2020). https:\/\/link.springer.com\/article\/10.1007\/s00799-020-00279-3","DOI":"10.1007\/s00799-020-00279-3"},{"key":"302_CR16","doi-asserted-by":"crossref","unstructured":"Jackson, A.S., Han, M.J., Groetsch, K., Mustafoff, M., Cole, T.W.: Dublin core metadata harvested through oai-pmh. J. Library Metadata 8(1), 5\u201321 (2008). https:\/\/www.tandfonline.com\/doi\/abs\/10.1300\/J517v08n01_02","DOI":"10.1300\/J517v08n01_02"},{"key":"302_CR17","doi-asserted-by":"crossref","unstructured":"Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427\u2013431. Association for Computational Linguistics, Valencia, Spain (2017). https:\/\/www.aclweb.org\/anthology\/E17-2068","DOI":"10.18653\/v1\/E17-2068"},{"key":"302_CR18","unstructured":"Kir\u00e1ly, P.: A metadata quality assurance framework (2015). http:\/\/pkiraly.github.io\/metadata-quality-project-plan.pdf. (Research project plan)"},{"key":"302_CR19","doi-asserted-by":"publisher","unstructured":"Kir\u00e1ly, P., B\u00fcchler, M.: Measuring completeness as metadata quality metric in Europeana. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 2711\u20132720 (2018). https:\/\/doi.org\/10.1109\/BigData.2018.8622487. https:\/\/ieeexplore.ieee.org\/abstract\/document\/8622487","DOI":"10.1109\/BigData.2018.8622487"},{"key":"302_CR20","doi-asserted-by":"crossref","unstructured":"Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159\u2013174 (1977). https:\/\/www.jstor.org\/stable\/2529310","DOI":"10.2307\/2529310"},{"key":"302_CR21","doi-asserted-by":"publisher","unstructured":"Lorenzini, M., Rospocher, M., Tonelli, S.: Annotated dataset to assess the accuracy of the textual description of cultural heritage records. https:\/\/doi.org\/10.6084\/m9.figshare.13359104","DOI":"10.6084\/m9.figshare.13359104"},{"key":"302_CR22","unstructured":"Lorenzini, M., Rospocher, M., Tonelli, S.: Computer assisted curation of digital cultural heritage repositories. In: Digital Humanities Conference 2019 (DH2019) (2019). https:\/\/dev.clariah.nl\/files\/dh2019\/boa\/0807.html"},{"key":"302_CR23","unstructured":"Margaritopoulos, T., Margaritopoulos, M., Mavridis, I., Manitsaris, A.: A conceptual framework for metadata quality assessment. In: International Conference on Dublin Core and Metadata Applications 0(0), 104\u2013113 (2008). URL https:\/\/dcpapers.dublincore.org\/pubs\/article\/view\/923"},{"key":"302_CR24","unstructured":"Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746\u2013751. Association for Computational Linguistics, Atlanta, Georgia (2013). https:\/\/www.aclweb.org\/anthology\/N13-1090"},{"key":"302_CR25","doi-asserted-by":"publisher","DOI":"10.1145\/2964909","author":"S Neumaier","year":"2016","unstructured":"Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Quality (2016). https:\/\/doi.org\/10.1145\/2964909","journal-title":"J. Data Inf. Quality"},{"key":"302_CR26","doi-asserted-by":"publisher","unstructured":"Newman, D., Hagedorn, K., Chemudugunta, C., Smyth, P.: Subject metadata enrichment using statistical topic models. In: Proceedings of the 7th ACM\/IEEE-CS Joint Conference on Digital Libraries, JCDL \u201907, pp. 366-375. Association for Computing Machinery, New York, NY, USA (2007). https:\/\/doi.org\/10.1145\/1255175.1255248","DOI":"10.1145\/1255175.1255248"},{"key":"302_CR27","unstructured":"NISO Framework Working Group (with support from the Institute of Museum and Library Services): A framework of guidance for building good digital collections. Baltimore, MD: National Information Standards Organization (NISO) (2007). URL https:\/\/www.niso.org\/sites\/default\/files\/2017-08\/framework3.pdf"},{"issue":"2\u20133","key":"302_CR28","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1007\/s00799-009-0054-4","volume":"10","author":"X Ochoa","year":"2009","unstructured":"Ochoa, X., Duval, E.: Automatic evaluation of metadata quality in digital repositories. Int. J. Digit. Libraries 10(2\u20133), 67\u201391 (2009). https:\/\/doi.org\/10.1007\/s00799-009-0054-4","journal-title":"Int. J. Digit. Libraries"},{"key":"302_CR29","unstructured":"Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011). https:\/\/www.jmlr.org\/papers\/volume12\/pedregosa11a\/pedregosa11a.pdf"},{"key":"302_CR30","doi-asserted-by":"crossref","unstructured":"Pelaez, A.R., Alarcon, P.P.: Metadata quality assessment metrics into ocw repositories. In: Proceedings of the 2017 9th International Conference on Education Technology and Computers, pp. 253\u2013257 (2017). https:\/\/dl.acm.org\/doi\/10.1145\/3175536.3175579","DOI":"10.1145\/3175536.3175579"},{"key":"302_CR31","unstructured":"Pennock, M.: Digital curation: a life-cycle approach to managing and preserving usable digital information. Library Arch. 1, 34\u201345 (2007). https:\/\/www.ukoln.ac.uk\/ukoln\/staff\/m.pennock\/publications\/docs\/lib-arch_curation.pdf"},{"key":"302_CR32","unstructured":"Pustejovsky, J., Stubbs, A.: Natural Language annotation for machine learning\u2014a guide to corpus-building for applications. O\u2019Reilly (2012). http:\/\/www.oreilly.de\/catalog\/9781449306663\/index.html"},{"key":"302_CR33","doi-asserted-by":"crossref","unstructured":"Reiche, K.J., H\u00f6fig, E.: Implementation of metadata quality metrics and application on public government data. In: 2013 IEEE 37th Annual Computer Software and Applications Conference Workshops, pp. 236\u2013241 (2013). https:\/\/ieeexplore.ieee.org\/document\/6605795","DOI":"10.1109\/COMPSACW.2013.32"},{"key":"302_CR34","doi-asserted-by":"crossref","unstructured":"Sch\u00f6lkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press (2001). https:\/\/mitpress.mit.edu\/books\/learning-kernels","DOI":"10.7551\/mitpress\/4175.001.0001"},{"key":"302_CR35","doi-asserted-by":"crossref","unstructured":"Sch\u00f6lkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207\u20131245 (2000). https:\/\/www.mitpressjournals.org\/doi\/abs\/10.1162\/089976600300015565?journalCode=neco","DOI":"10.1162\/089976600300015565"},{"key":"302_CR36","doi-asserted-by":"crossref","unstructured":"Stvilia, B., Gasser, L., Twidale, M.B., Smith, L.C.: A framework for information quality assessment. J. Am. Soc. Inf. Sci. Technol. 58(12), 1720\u20131733 (2007). URL https:\/\/myweb.fsu.edu\/bstvilia\/papers\/stvilia_IQFramework_p.pdf","DOI":"10.1002\/asi.20652"},{"key":"302_CR37","doi-asserted-by":"crossref","unstructured":"Tani, A., Candela, L., Castelli, D.: Dealing with metadata quality: the legacy of digital library efforts. Inf. Process. Manag. 49(6), 1194\u20131205 (2013). https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S0306457313000526","DOI":"10.1016\/j.ipm.2013.05.003"},{"key":"302_CR38","doi-asserted-by":"crossref","unstructured":"Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Statistical Methodology) 61(3), 611\u2013622 (1999). http:\/\/www.jstor.org\/stable\/2680726","DOI":"10.1111\/1467-9868.00196"},{"key":"302_CR39","doi-asserted-by":"publisher","unstructured":"Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86\u201395 (1996). https:\/\/doi.org\/10.1145\/240455.240479","DOI":"10.1145\/240455.240479"}],"container-title":["International Journal on Digital Libraries"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-021-00302-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00799-021-00302-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-021-00302-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,29]],"date-time":"2024-08-29T05:12:53Z","timestamp":1724908373000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00799-021-00302-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,23]]},"references-count":39,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,6]]}},"alternative-id":["302"],"URL":"https:\/\/doi.org\/10.1007\/s00799-021-00302-1","relation":{},"ISSN":["1432-5012","1432-1300"],"issn-type":[{"type":"print","value":"1432-5012"},{"type":"electronic","value":"1432-1300"}],"subject":[],"published":{"date-parts":[[2021,4,23]]},"assertion":[{"value":"10 July 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 March 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 April 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 April 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}