{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T20:35:03Z","timestamp":1780346103949,"version":"3.54.1"},"reference-count":24,"publisher":"China Science Publishing & Media Ltd.","issue":"2","license":[{"start":{"date-parts":[[2022,3,7]],"date-time":"2022-03-07T00:00:00Z","timestamp":1646611200000},"content-version":"vor","delay-in-days":65,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Since their introduction by James Dixon in 2010, data lakes get more and more attention, driven by the promise of high reusability of the stored data due to the schema-on-read semantics. Building on this idea, several additional requirements were discussed in literature to improve the general usability of the concept, like a central metadata catalog including all provenance information, an overarching data governance, or the integration with (high-performance) processing capabilities. Although the necessity for a logical and a physical organisation of data lakes in order to meet those requirements is widely recognized, no concrete guidelines are yet provided. The most common architecture implementing this conceptual organisation is the zone architecture, where data is assigned to a certain zone depending on the degree of processing. This paper discusses how FAIR Digital Objects can be used in a novel approach to organize a data lake based on data types instead of zones, how they can be used to abstract the physical implementation, and how they empower generic and portable processing capabilities based on a provenance-based approach.<\/jats:p>","DOI":"10.1162\/dint_a_00141","type":"journal-article","created":{"date-parts":[[2022,3,7]],"date-time":"2022-03-07T18:07:11Z","timestamp":1646676431000},"page":"426-438","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":3,"title":["Realising Data-Centric Scientific Workflows with Provenance-Capturing on Data Lakes"],"prefix":"10.3724","volume":"4","author":[{"given":"Hendrik","family":"Nolte","sequence":"first","affiliation":[{"name":"Gesellschaft f\u00fcr wissenschaftliche Datenverarbeitung mbH G\u00f6ttingen G\u00f6ttingen, Gottingen 37077, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Philipp","family":"Wieder","sequence":"additional","affiliation":[{"name":"Gesellschaft f\u00fcr wissenschaftliche Datenverarbeitung mbH G\u00f6ttingen G\u00f6ttingen, Gottingen 37077, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"2026","published-online":{"date-parts":[[2022,4,1]]},"reference":[{"issue":"2","key":"2022042714423111800_ref1","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/j.jksuci.2011.05.005","article-title":"A proposed model for data warehouse ETL processes","volume":"23","author":"Ali El-Sappagh","year":"2011","journal-title":"Journal of King Saud University\u2014Computer and Information Sciences"},{"key":"2022042714423111800_ref2","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1007\/978-3-030-64148-1_11","article-title":"Data pipeline management in practice: Challenges and opportunities","volume-title":"Product-Focused Software Process Improvement","author":"Munappy","year":"2020"},{"key":"2022042714423111800_ref3","volume-title":"Pentaho, Hadoop, and data lakes","author":"Dixon"},{"key":"2022042714423111800_ref4","first-page":"174","volume-title":"The next information architecture evolution: The data lake wave","author":"Madera","year":"2016"},{"key":"2022042714423111800_ref5","first-page":"179","volume-title":"Leveraging the data lake\u2014current state and challenges","author":"Giebler","year":"2019"},{"key":"2022042714423111800_ref6","doi-asserted-by":"crossref","first-page":"03025","DOI":"10.1051\/itmconf\/20181703025","article-title":"Data lake: a new ideology in big data era","volume":"17","author":"Khine","year":"2018","journal-title":"ITM Web Conf."},{"key":"2022042714423111800_ref7","article-title":"Query Rewriting for Heterogeneous Data Lakes","volume":"11019","author":"Hai","year":"2018","journal-title":"IAdvances in Databases and Information Systems"},{"key":"2022042714423111800_ref8","doi-asserted-by":"crossref","DOI":"10.1007\/s13222-017-0272-7","article-title":"Data Lakes","volume":"17","author":"Mathis","year":"2017","journal-title":"Datenbank Spektrum"},{"key":"2022042714423111800_ref9","first-page":"2097","volume-title":"Constance: An Intelligent Data Lake System","author":"Hai","year":"2016"},{"key":"2022042714423111800_ref10","volume-title":"Metadata Systems for Data Lakes: Models and Features","author":"Sawadogo","year":"2019"},{"key":"2022042714423111800_ref11","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1007\/978-3-319-98398-1_3","article-title":"Query Rewriting for Heterogeneous Data Lakes","volume-title":"Advances in Databases and Information Systems","author":"Hai","year":"2018"},{"key":"2022042714423111800_ref12","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1109\/eScience.2016.7870919","volume-title":"2016 IEEE 12th International Conference on e-Science (e-Science)","author":"Suriarachchi","year":"2016"},{"issue":"2","key":"2022042714423111800_ref13","doi-asserted-by":"crossref","DOI":"10.3390\/publications8020021","article-title":"FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units","volume":"8","author":"De Smedt","year":"2020","journal-title":"Publications"},{"issue":"3","key":"2022042714423111800_ref14","first-page":"5","article-title":"Managing Google's data lake: an overview of the Goods system","volume":"39","author":"Halevy","year":"2016","journal-title":"IEEE Data Eng. Bull."},{"key":"2022042714423111800_ref15","volume-title":"arXiv preprint arXiv:1409.0798","author":"Bhardwaj","year":"2014"},{"key":"2022042714423111800_ref16","first-page":"1","volume-title":"Proceedings of the 2nd Workshop on Human-in-the-Loop Data Analytics","author":"Miao","year":"2017"},{"issue":"1","key":"2022042714423111800_ref17","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1007\/s10844-020-00608-7","article-title":"On data lake architectures and metadata management","volume":"56","author":"Sawadogo","year":"2021","journal-title":"Journal of Intelligent Information Systems"},{"key":"2022042714423111800_ref18","first-page":"57","volume-title":"A zone reference model for enterprise-grade data lake management","author":"Giebler","year":"2020"},{"key":"2022042714423111800_ref19","volume-title":"Nature Precedings","author":"Bechhofer","year":"2010"},{"key":"2022042714423111800_ref20","volume-title":"CoRR","author":"Dai Hai Ton That and Gabriel Fils and Zhihao Yuan and Tanu Malik","year":"2017"},{"key":"2022042714423111800_ref21","first-page":"15","volume-title":"INFOCOMP 2021: The Eleventh International Conference on Advanced Communications and Computation","author":"Bingert","year":"2021"},{"key":"2022042714423111800_ref22","volume-title":"Common workflow language, v1","author":"Amstutz"},{"key":"2022042714423111800_ref23","volume-title":"International Semantic Web Conference (P&D\/Industry\/BlueSky)","author":"Samuel","year":"2018"},{"key":"2022042714423111800_ref24","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1109\/BigData.2016.7840618","volume-title":"2016 IEEE International Conference on Big Data (Big Data)","author":"Chard","year":"2016"}],"container-title":["Data Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/4\/2\/426\/2012419\/dint_a_00141.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/4\/2\/426\/2012419\/dint_a_00141.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T07:41:48Z","timestamp":1741938108000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.sciengine.com\/doi\/10.1162\/dint_a_00141"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":24,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,4,1]]}},"URL":"https:\/\/doi.org\/10.1162\/dint_a_00141","relation":{},"ISSN":["2641-435X"],"issn-type":[{"value":"2641-435X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}