{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,16]],"date-time":"2025-12-16T12:40:43Z","timestamp":1765888843934,"version":"3.38.0"},"reference-count":64,"publisher":"China Science Publishing & Media Ltd.","issue":"2","license":[{"start":{"date-parts":[[2022,3,7]],"date-time":"2022-03-07T00:00:00Z","timestamp":1646611200000},"content-version":"vor","delay-in-days":65,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud-based workflow platform\u2014the \u2018Specimen Data Refinery\u2019 (SDR)\u2014founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens\u2019 labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed\/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.<\/jats:p>","DOI":"10.1162\/dint_a_00134","type":"journal-article","created":{"date-parts":[[2022,3,7]],"date-time":"2022-03-07T18:06:58Z","timestamp":1646676418000},"page":"320-341","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":12,"title":["The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections"],"prefix":"10.3724","volume":"4","author":[{"given":"Alex","family":"Hardisty","sequence":"first","affiliation":[{"name":"School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK"}]},{"given":"Paul","family":"Brack","sequence":"additional","affiliation":[{"name":"The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK"}]},{"given":"Carole","family":"Goble","sequence":"additional","affiliation":[{"name":"The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK"}]},{"given":"Laurence","family":"Livermore","sequence":"additional","affiliation":[{"name":"The Natural History Museum, London SW7 5BD, UK"}]},{"given":"Ben","family":"Scott","sequence":"additional","affiliation":[{"name":"The Natural History Museum, London SW7 5BD, UK"}]},{"given":"Quentin","family":"Groom","sequence":"additional","affiliation":[{"name":"Meise Botanic Garden, 1860 Meise, Belgium"}]},{"given":"Stuart","family":"Owen","sequence":"additional","affiliation":[{"name":"The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK"}]},{"given":"Stian","family":"Soiland-Reyes","sequence":"additional","affiliation":[{"name":"The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK"},{"name":"Informatics Institute, Faculty of Science, University of Amsterdam, 1090 GH Amsterdam, The Netherlands"}]}],"member":"2026","published-online":{"date-parts":[[2022,4,1]]},"reference":[{"key":"2022042714421804400_ref1","doi-asserted-by":"crossref","first-page":"e57602","DOI":"10.3897\/rio.6.e57602","article-title":"Landscape analysis for the specimen data refinery","volume":"6","author":"Walton","year":"2020","journal-title":"Research Ideas and Outcomes"},{"key":"2022042714421804400_ref2","first-page":"324","volume-title":"Digitization of the New York Botanical Garden herbarium","author":"Thiers","year":"2016"},{"key":"2022042714421804400_ref3","doi-asserted-by":"crossref","first-page":"20170391","DOI":"10.1098\/rstb.2017.0391","article-title":"The history and impact of digitization and digital data mobilization on biodiversity research","volume":"374","author":"Nelson","year":"2019","journal-title":"Philosophical Transactions of the Royal Society B: Biological Sciences"},{"key":"2022042714421804400_ref4","doi-asserted-by":"crossref","first-page":"e37896","DOI":"10.3897\/biss.3.37896","article-title":"DiSSCo, iDigBio and the future of global collaboration","volume":"3","author":"Nelson","year":"2019","journal-title":"Biodiversity Information Science and Standards"},{"key":"2022042714421804400_ref5","doi-asserted-by":"crossref","first-page":"e37502","DOI":"10.3897\/biss.3.37502","article-title":"DiSSCo as a new regional model for scientific collections in Europe","volume":"3","author":"Addink","year":"2019","journal-title":"Biodiversity Information Science and Standards"},{"issue":"1-2","key":"2022042714421804400_ref6","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1162\/dint_a_00034","article-title":"FAIR data and services in biodiversity science and geoscience","volume":"2","author":"Lannom","year":"2020","journal-title":"Data Intelligence"},{"volume-title":"GBIF Science Review 2020","author":"GBIF Secretariat","key":"2022042714421804400_ref7"},{"key":"2022042714421804400_ref8","first-page":"e2018093118","volume-title":"Data integration enables global biodiversity synthesis","author":"Heberling","year":"2021"},{"key":"2022042714421804400_ref9","doi-asserted-by":"crossref","first-page":"165","DOI":"10.12705\/671.9","article-title":"Large-scale digitization of herbarium specimens: Development and usage of an automated, high-throughput conveyor system","volume":"67","author":"Sweeney","year":"2018","journal-title":"Taxon"},{"key":"2022042714421804400_ref10","doi-asserted-by":"crossref","first-page":"e32342","DOI":"10.3897\/BDJ.7.e32342","article-title":"A novel automated mass digitisation workflow for natural history microscope slides","volume":"7","author":"Allan","year":"2019","journal-title":"Biodiversity Data Journal"},{"key":"2022042714421804400_ref11","doi-asserted-by":"crossref","first-page":"e37228","DOI":"10.3897\/biss.3.37228","article-title":"LightningBug ONE: An experiment in high-throughput digitization of pinned insects","volume":"3","author":"Hereld","year":"2019","journal-title":"Biodiversity Information Science and Standards"},{"volume-title":"ALICE: Angled label image capture and extraction for high throughput insect specimen digitisation","year":"2018","author":"Price","key":"2022042714421804400_ref12"},{"key":"2022042714421804400_ref13","first-page":"523","volume-title":"Mass digitization of individual pinned insects using conveyor-driven imaging","author":"Tegelberg","year":"2017"},{"issue":"10","key":"2022042714421804400_ref14","doi-asserted-by":"crossref","first-page":"812","DOI":"10.1093\/biosci\/biz094","article-title":"The changing uses of herbarium data in an era of global change: An overview using automated content analysis","volume":"69","author":"Heberling","year":"2019","journal-title":"BioScience"},{"issue":"1763","key":"2022042714421804400_ref15","article-title":"Using insect natural history collections to study global change impacts: challenges and opportunities","volume":"374","author":"Heather","year":"2019","journal-title":"Philosophical Transactions of the Royal Society B"},{"issue":"3","key":"2022042714421804400_ref16","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1093\/biosci\/biy163","article-title":"The evolution of natural history collections: New research tools move specimens, data to center stage","volume":"69","author":"Watanabe","year":"2019","journal-title":"BioScience"},{"key":"2022042714421804400_ref17","doi-asserted-by":"crossref","first-page":"511","DOI":"10.1111\/cobi.13289","article-title":"Harnessing the potential of integrated systematics for conservation of taxonomically complex, megadiverse plant groups","volume":"33","author":"Nic Lughadha","year":"2019","journal-title":"Conservation Biology"},{"key":"2022042714421804400_ref18","doi-asserted-by":"crossref","first-page":"e58030","DOI":"10.3897\/rio.6.e58030","article-title":"Towards a scientific workflow featuring natural language processing for the digitisation of natural history collections","volume":"6","author":"Owen","year":"2020","journal-title":"Research Ideas and Outcomes"},{"issue":"6","key":"2022042714421804400_ref19","doi-asserted-by":"crossref","first-page":"e107409","DOI":"10.15252\/embj.2020107409","article-title":"ELIXIR-EXCELERATE: Establishing Europe's data infrastructure for the life science research of the future","volume":"40","author":"Harrow","year":"2021","journal-title":"EMBO Journal"},{"key":"2022042714421804400_ref20","doi-asserted-by":"crossref","first-page":"W537","DOI":"10.1093\/nar\/gky379","article-title":"The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update","volume":"46","author":"Afgan","year":"2018","journal-title":"Nucleic Acids Research"},{"volume-title":"Methods included: Standardizing computational reuse and portability with the common workflow language","year":"2021","author":"Crusoe","key":"2022042714421804400_ref21"},{"volume-title":"A lightweight approach to research object data packaging","author":"Carrag\u00e1in","key":"2022042714421804400_ref22"},{"volume-title":"Packaging research artefacts with RO-Crate","year":"2021","author":"Soiland-Reyes","key":"2022042714421804400_ref23"},{"volume-title":"Implementing FAIR digital objects in the EOSC-Life workflow col laboratory","author":"Goble","key":"2022042714421804400_ref24"},{"key":"2022042714421804400_ref25","doi-asserted-by":"crossref","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Scientific Data"},{"issue":"2","key":"2022042714421804400_ref26","doi-asserted-by":"crossref","first-page":"286","DOI":"10.1162\/dint_a_00132","article-title":"Canonical Workflows to Make Data FAIR","volume":"4","author":"Wittenburg","year":"2022","journal-title":"Data Intelligence"},{"volume-title":"Provisional data management plan for DiSSCo infrastructure","author":"Hardisty","key":"2022042714421804400_ref27"},{"issue":"2","key":"2022042714421804400_ref28","doi-asserted-by":"crossref","DOI":"10.3390\/publications8020021","article-title":"FAIR digital objects for science: From data pieces to actionable knowledge units","volume":"8","author":"De Smedt","year":"2020","journal-title":"Publications"},{"key":"2022042714421804400_ref29","doi-asserted-by":"crossref","first-page":"e54280","DOI":"10.3897\/rio.6.e54280","article-title":"Conceptual design blueprint for the DiSSCo digitization infrastructure\u2014DELIVERABLE D8.1","volume":"6","author":"Hardisty","year":"2020","journal-title":"Research Ideas and Outcomes"},{"volume-title":"FDO Coordination Group (2020) FDO Framework","key":"2022042714421804400_ref30"},{"key":"2022042714421804400_ref31","first-page":"523","volume-title":"Objects detection from digitized herbarium specimen based on improved YOLO V3","author":"Triki","year":"2020"},{"volume-title":"Cross-validation of a semantic segmentation network for natural history collection specimens (Accepted)","author":"Nieva de la Hidalga","key":"2022042714421804400_ref32"},{"key":"2022042714421804400_ref33","doi-asserted-by":"crossref","first-page":"e56211","DOI":"10.3897\/rio.6.e56211","article-title":"A cost analysis of transcription systems","volume":"6","author":"Walton","year":"2020","journal-title":"Research Ideas and Outcomes"},{"key":"2022042714421804400_ref34","doi-asserted-by":"crossref","first-page":"baaa072","DOI":"10.1093\/database\/baaa072","article-title":"People are essential to linking biodiversity data","volume":"2020","author":"Groom","year":"2020","journal-title":"Database"},{"issue":"2","key":"2022042714421804400_ref35","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1093\/isd\/ixab004","article-title":"Pretrained convolutional neural networks perform well in a challenging test case: Identification of plant bugs (Hemiptera: Miridae) using a small number of training images","volume":"5","author":"Knyshov","year":"2021","journal-title":"Insect Systematics and Diversity"},{"volume-title":"Application of computer vision and machine learning for digitized herbarium specimens: A systematic literature review","year":"2021","author":"Hussein","key":"2022042714421804400_ref36"},{"key":"2022042714421804400_ref37","doi-asserted-by":"crossref","DOI":"10.1186\/s12862-017-1014-z","article-title":"Going deeper in the automated identification of herbarium specimens","volume":"17","author":"Carranza-Rojas","year":"2017","journal-title":"BMC Evolutionary Biology"},{"issue":"6","key":"2022042714421804400_ref38","doi-asserted-by":"crossref","first-page":"e11365","DOI":"10.1002\/aps3.11365","article-title":"An algorithm competition for automatic species identification from herbarium specimens","volume":"8","author":"Little","year":"2020","journal-title":"Applications in Plant Sciences"},{"issue":"6","key":"2022042714421804400_ref39","doi-asserted-by":"crossref","first-page":"e11372","DOI":"10.1002\/aps3.11372","article-title":"Using computer vision on herbarium specimen images to discriminate among closely related horsetails (Equisetum)","volume":"8","author":"Pryer","year":"2020","journal-title":"Applications in Plant Sciences"},{"key":"2022042714421804400_ref40","doi-asserted-by":"crossref","DOI":"10.1186\/s12862-016-0827-5","article-title":"Computer vision applied to herbarium specimens of German trees: Testing the future utility of the millions of herbarium specimen images for automated identification","volume":"16","author":"Unger","year":"2016","journal-title":"BMC Evolutionary Biology"},{"key":"2022042714421804400_ref41","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1016\/j.future.2017.05.041","article-title":"Scientific workflows: Past, present and future","volume":"75","author":"Atkinson","year":"2017","journal-title":"Future Generation Computer Systems"},{"volume-title":"Existing workflow systems","author":"Amstutz","key":"2022042714421804400_ref42"},{"key":"2022042714421804400_ref43","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1111\/j.1467-9973.2012.01761.x","article-title":"What is a digital object?","volume":"43","author":"Hui","year":"2012","journal-title":"Metaphilosophy"},{"key":"2022042714421804400_ref44","doi-asserted-by":"crossref","first-page":"357","DOI":"10.25300\/MISQ\/2013\/37.2.02","article-title":"The ambivalent ontology of digital artifacts","volume":"37","author":"Kallinikos","year":"2013","journal-title":"MIS Quarterly"},{"key":"2022042714421804400_ref45","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1007\/s00799-005-0128-x","article-title":"A framework for distributed digital object services","volume":"6","author":"Kahn","year":"2006","journal-title":"International Journal on Digital Libraries"},{"volume-title":"Draft specification for open Digital Specimens (openDS)","author":"openDS","key":"2022042714421804400_ref46"},{"volume-title":"The JavaScript Object Notation (JSON) data interchange format (Request for Comments No","author":"Bray","key":"2022042714421804400_ref47"},{"key":"2022042714421804400_ref48","doi-asserted-by":"crossref","first-page":"599","DOI":"10.1016\/j.future.2011.08.004","article-title":"Why linked data is not enough for scientists","volume":"29","author":"Bechhofer","year":"2013","journal-title":"Future Generation Computer Systems, Special section: Recent advances in e-Science"},{"volume-title":"JSON-LD 1.1 A JSON-based serialization for linked data","author":"Kellogg","key":"2022042714421804400_ref49"},{"volume-title":"Schema.org\u2014Schema.org","key":"2022042714421804400_ref50"},{"volume-title":"D5.1 RO model adapted to EOSC","author":"Corcho","key":"2022042714421804400_ref51"},{"volume-title":"Implementing FAIR digital objects in the EOSC-Life workflow collaboratory","author":"Goble","key":"2022042714421804400_ref52"},{"volume-title":"Workflow RO-Crate profile 1.0","author":"Bacall","key":"2022042714421804400_ref53"},{"volume-title":"FAIR signposting profile","author":"Van de Sompel","key":"2022042714421804400_ref54"},{"key":"2022042714421804400_ref55","doi-asserted-by":"crossref","first-page":"e50503","DOI":"10.3897\/BDJ.8.e50503","article-title":"Georeferencing the natural history museum's Chinese type collection of plateaus, pagodas and plants","volume":"8","author":"Lohonya","year":"2020","journal-title":"Biodiversity Data Journal"},{"key":"2022042714421804400_ref56","first-page":"26","volume-title":"Anchors in shifting sand: The primacy of method in the Web of data","author":"De Roure","year":"2010"},{"key":"2022042714421804400_ref57","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1186\/s12898-016-0103-y","article-title":"BioVeL: A virtual laboratory for data analysis and modelling in biodiversity science and ecology","volume":"16","author":"Hardisty","year":"2016","journal-title":"BMC Ecology"},{"key":"2022042714421804400_ref58","doi-asserted-by":"crossref","first-page":"e31817","DOI":"10.3897\/BDJ.7.e31817","article-title":"A benchmark dataset of herbarium specimen images with label data","volume":"7","author":"Dillen","year":"2019","journal-title":"Biodiversity Data Journal"},{"volume-title":"JSONPath: Query expressions for JSON","author":"G\u00f6ssner","key":"2022042714421804400_ref59"},{"volume-title":"Digital object architecture","author":"DONA Foundation","key":"2022042714421804400_ref60"},{"volume-title":"Digital Object Interface Protocol Specification, version 2.0, November 2018","key":"2022042714421804400_ref61"},{"volume-title":"RFC 3652 Handle System Protocol (ver 2.1) Specification","author":"Sun","key":"2022042714421804400_ref62"},{"issue":"50","key":"2022042714421804400_ref63","first-page":"1","article-title":"Incorporating RDA outputs in the design of a European research infrastructure for natural history collections","volume":"19","author":"Islam","year":"2020","journal-title":"Data Science Journal"},{"volume-title":"Linked data platform 1.0","author":"Speicher","key":"2022042714421804400_ref64"}],"container-title":["Data Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/4\/2\/320\/2012442\/dint_a_00134.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/4\/2\/320\/2012442\/dint_a_00134.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T07:42:57Z","timestamp":1741938177000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.sciengine.com\/doi\/10.1162\/dint_a_00134"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":64,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,4,1]]}},"URL":"https:\/\/doi.org\/10.1162\/dint_a_00134","relation":{},"ISSN":["2641-435X"],"issn-type":[{"type":"electronic","value":"2641-435X"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}