{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:06:44Z","timestamp":1775066804686,"version":"3.50.1"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:p>Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.<\/jats:p>","DOI":"10.14778\/3436905.3436911","type":"journal-article","created":{"date-parts":[[2021,2,22]],"date-time":"2021-02-22T17:23:50Z","timestamp":1614014630000},"page":"507-520","source":"Crossref","is-referenced-by-count":28,"title":["Capturing and querying fine-grained provenance of preprocessing pipelines in data science"],"prefix":"10.14778","volume":"14","author":[{"given":"Adriane","family":"Chapman","sequence":"first","affiliation":[{"name":"University of Southampton"}]},{"given":"Paolo","family":"Missier","sequence":"additional","affiliation":[{"name":"Newcastle University"}]},{"given":"Giulia","family":"Simonelli","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre"}]},{"given":"Riccardo","family":"Torlone","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre"}]}],"member":"320","published-online":{"date-parts":[[2021,2,22]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314050"},{"key":"e_1_2_1_2_1","volume-title":"Advances in Neural Information Processing Systems. Curran Associates","author":"Alaa Ahmed M"},{"key":"e_1_2_1_3_1","first-page":"51","article-title":"GProM - A Swiss Army Knife for Your Provenance Needs","volume":"41","author":"Arab Bahareh Sadat","year":"2018","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/645504.656274"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066296"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/2567709.2567736"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/2567709.2567736"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3351095.3372878"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1186\/s41044-016-0014-0"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning, ICML 2019","volume":"2251","author":"Ghorbani Amirata","year":"2019"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.15"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1265530.1265535"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/WORKS.2018.00009"},{"key":"e_1_2_1_15_1","unstructured":"Trung Dong Huynh. 2018. Prov Python. https:\/\/prov.readthedocs.io\/en\/latest\/index.html  Trung Dong Huynh. 2018. Prov Python. https:\/\/prov.readthedocs.io\/en\/latest\/index.html"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.118"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850583.2850595"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3058730"},{"key":"e_1_2_1_19_1","volume-title":"CoRR abs\/1707.01154","author":"Lakkaraju Himabindu","year":"2017"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2017.105"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389763"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Timothy McPhillips Tianhong Song Tyler Kolisnik Steve Aulenbach Khalid Belhajjame Kyle Bocinsky Yang Cao Fernando Chirigati Saumen Dey Juliana Freire etal 2015. YesWorkflow: a user-oriented language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).  Timothy McPhillips Tianhong Song Tyler Kolisnik Steve Aulenbach Khalid Belhajjame Kyle Bocinsky Yang Cao Fernando Chirigati Saumen Dey Juliana Freire et al. 2015. YesWorkflow: a user-oriented language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).","DOI":"10.2218\/ijdc.v10i1.370"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.3390\/genes10020087"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2017.2659745"},{"key":"e_1_2_1_25_1","unstructured":"Luc Moreau James Cheney and Paolo Missier. 2013. Constraints of the PROV data model. http:\/\/www.w3.org\/TR\/2013\/REC-prov-constraints-20130430\/  Luc Moreau James Cheney and Paolo Missier. 2013. Constraints of the PROV data model. http:\/\/www.w3.org\/TR\/2013\/REC-prov-constraints-20130430\/"},{"key":"e_1_2_1_26_1","volume-title":"Prov-dm: The prov data model. W3C Recommendation REC-prov-dm-20130430","author":"Moreau Luc","year":"2013"},{"key":"e_1_2_1_27_1","volume-title":"Explaining machine learning classifiers through diverse counterfactual explanations. arXiv preprint arXiv:1905.07697","author":"Mothilal Ramaravind Kommiya","year":"2019"},{"key":"e_1_2_1_28_1","volume-title":"Vamsa: Tracking Provenance in Data Science Scripts. arXiv:2001.01861 [cs.LG]","author":"Namaki Mohammad Hossein","year":"2020"},{"key":"e_1_2_1_29_1","volume-title":"Proc. Conf. Fairness Accountability Transp.","author":"Narayanan Arvind","year":"2018"},{"key":"e_1_2_1_30_1","volume-title":"Provenance-Aware Query Optimization. In 33rd IEEE International Conference on Data Engineering, ICDE 2017","author":"Niu Xing","year":"2017"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407807"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40593-3_13"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40593-3_21"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3311955"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137789"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733009"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.5555\/3199517.3199522"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/3199517.3199522"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_2_1_41_1","volume-title":"Declarative Metadata Management: A Missing Piece in End-To-End Machine Learning. In SysML Conference.","author":"Schelter Sebastian","year":"2018"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2019.00161"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_2_1_44_1","volume-title":"Christian Drescher, Alexander Hanuschkin, Ludwig Winkler, Steven Peters, and Klaus-Robert Mueller.","author":"Studer Stefan","year":"2020"},{"key":"e_1_2_1_45_1","volume-title":"SAC: A System for Big Data Lineage Tracking. In 35th IEEE International Conference on Data Engineering, ICDE 2019","author":"Tang MingJie","year":"2019"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196934"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380571"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.5555\/3026947.3026948"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.3897\/tdwgproceedings.1.20380"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00025"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3436905.3436911","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:24:47Z","timestamp":1672223087000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3436905.3436911"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12]]},"references-count":49,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["10.14778\/3436905.3436911"],"URL":"https:\/\/doi.org\/10.14778\/3436905.3436911","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2020,12]]}}}