{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T22:56:26Z","timestamp":1768344986406,"version":"3.49.0"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T00:00:00Z","timestamp":1712707200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Database Syst."],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>\n            Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim at providing data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a\n            <jats:italic>provenance semantics<\/jats:italic>\n            embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input\/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.\n          <\/jats:p>","DOI":"10.1145\/3644385","type":"journal-article","created":{"date-parts":[[2024,2,9]],"date-time":"2024-02-09T11:54:38Z","timestamp":1707479678000},"page":"1-42","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance"],"prefix":"10.1145","volume":"49","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3814-2587","authenticated-orcid":false,"given":"Adriane","family":"Chapman","sequence":"first","affiliation":[{"name":"University of Southampton, Southampton, UK"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0845-9596","authenticated-orcid":false,"given":"Luca","family":"Lauro","sequence":"additional","affiliation":[{"name":"Universit\u00e0 Roma Tre, Roma, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0978-2446","authenticated-orcid":false,"given":"Paolo","family":"Missier","sequence":"additional","affiliation":[{"name":"Newcastle University, Newcastle upon Tyne, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1484-3693","authenticated-orcid":false,"given":"Riccardo","family":"Torlone","sequence":"additional","affiliation":[{"name":"Universit\u00e1 Roma Tre, Roma, Italy"}]}],"member":"320","published-online":{"date-parts":[[2024,4,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314050"},{"key":"e_1_3_2_3_2","first-page":"11301","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Alaa Ahmed M.","year":"2019","unstructured":"Ahmed M. Alaa and Mihaela van der Schaar. 2019. Demystifying black-box models with symbolic metamodels. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., 11301\u201311311."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-17819-1_12"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","unstructured":"Yael Amsterdamer Susan B. Davidson Daniel Deutch Tova Milo Julia Stoyanovich and Val Tannen. 2011. Putting lipstick on pig: enabling database-style workflow provenance. Proc. VLDB Endow. 5 4 (dec 2011) 346\u2013357. 10.14778\/2095686.2095693","DOI":"10.14778\/2095686.2095693"},{"issue":"1","key":"e_1_3_2_6_2","first-page":"51","article-title":"GProM - A swiss army knife for your provenance needs","volume":"41","author":"Arab Bahareh Sadat","year":"2018","unstructured":"Bahareh Sadat Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, and Qitian Zeng. 2018. GProM - A swiss army knife for your provenance needs. IEEEDataEngineeringBulletin 41, 1 (2018), 51\u201362.","journal-title":"IEEEDataEngineeringBulletin"},{"key":"e_1_3_2_7_2","volume-title":"Proceedings of the 13th International Workshop on Theory and Practice of Provenance (TaPP 2021)","author":"Blount Tom","year":"2021","unstructured":"Tom Blount, Adriane Chapman, Michael Johnson, and Bertram Ludascher. 2021. Observed vs. possible provenance. In Proceedings of the 13th International Workshop on Theory and Practice of Provenance (TaPP 2021)."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-44503-X_20"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0100985"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559901"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.14778\/3436905.3436911"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1561\/1900000006"},{"key":"e_1_3_2_13_2","volume-title":"Rethinking the Application-Database Interface","author":"Cheung Alvin","year":"2015","unstructured":"Alvin Cheung. 2015. Rethinking the Application-Database Interface. Ph.D. Dissertation. Massachusetts Institute of Technology."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066296"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/1350745.1350752"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.5555\/2567709.2567736"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.5555\/2567709.2567736"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3351095.3372878"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1186\/s41044-016-0014-0"},{"key":"e_1_3_2_20_2","series-title":"Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA.","first-page":"2242","volume":"97","author":"Ghorbani Amirata","year":"2019","unstructured":"Amirata Ghorbani and James Y. Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA.Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, PMLR, 2242\u20132251."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.15"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.15"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559980"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543873.3587557"},{"key":"e_1_3_2_25_2","first-page":"1","article-title":"Data distribution debugging in machine learning pipelines","author":"Grafberger Stefan","year":"2022","unstructured":"Stefan Grafberger, Paul Groth, Julia Stoyanovich, and Sebastian Schelter. 2022. Data distribution debugging in machine learning pipelines. The VLDB Journal 31 (2022), 1\u201324.","journal-title":"The VLDB Journal"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00726-w"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452759"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/1265530.1265535"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/WORKS.2018.00009"},{"key":"e_1_3_2_30_2","unstructured":"Trung Dong Huynh. 2018. Prov Python. (2018). Retrieved from https:\/\/prov.readthedocs.io\/en\/latest\/index.html. Accessed 26 February 2024."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.118"},{"key":"e_1_3_2_32_2","first-page":"216","article-title":"Titian: Data provenance support in spark","volume":"9","author":"Interlandi Matteo","year":"2016","unstructured":"Matteo Interlandi, Kshitij Shah, Sai Tetali, Muhammad Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2016. Titian: Data provenance support in spark. Proceedings of the VLDB Endowment International Conference on Very Large Data Bases 9, 3 (2016), 216\u2013227.","journal-title":"Proceedings of the VLDB Endowment International Conference on Very Large Data Bases"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3058730"},{"key":"e_1_3_2_34_2","unstructured":"Himabindu Lakkaraju Ece Kamar Rich Caruana and Jure Leskovec. 2017. Interpretable & Explorable Approximations of Black Box Models. CoRR abs\/1707.01154 (2017). arXiv:1707.01154. http:\/\/arxiv.org\/abs\/1707.01154"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2017.105"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.32614\/RJ-2023-003"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389763"},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","unstructured":"Timothy M. McPhillips Tianhong Song Tyler Kolisnik Steve Aulenbach Khalid Belhajjame Kyle Bocinsky Yang Cao Fernando Chirigati Saumen C. Dey Juliana Freire Deborah N. Huntzinger Christopher Jones David Koop Paolo Missier Mark Schildhauer Christopher R. Schwalm Yaxing Wei James Cheney Mark Bieda and Bertram Lud\u00e4scher. 2015. YesWorkflow: A user-oriented language-independent tool for recovering workflow information from scripts. arXiv:1502.02403. Retrieved from https:\/\/arxiv.org\/abs\/1502.02403","DOI":"10.2218\/ijdc.v10i1.370"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.3390\/genes10020087"},{"key":"e_1_3_2_40_2","unstructured":"Luc Moreau James Cheney and Paolo Missier. 2013. Constraints of the PROV Data Model. (2013). Retrieved from http:\/\/www.w3.org\/TR\/2013\/REC-prov-constraints-20130430\/. Accessed 26 February 2024."},{"key":"e_1_3_2_41_2","volume-title":"PROV-DM: The PROV Data Model","author":"Moreau Luc","year":"2012","unstructured":"Luc Moreau, Paolo Missier, Khalid Belhajjame, Reza B\u2019Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, Simon Miles, James Myers, Satya Sahoo, and Curt Tilmes. 2012. PROV-DM: The PROV Data Model. Technical Report. World Wide Web Consortium. Retrieved from http:\/\/www.w3.org\/TR\/prov-dm\/"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","unstructured":"Ramaravind Kommiya Mothilal Amit Sharma and Chenhao Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. In FAT*\u201920: Conference on Fairness Accountability and Transparency Barcelona Spain January 27-30 2020 Mireille Hildebrandt Carlos Castillo L. Elisa Celis Salvatore Ruggieri Linnet Taylor and Gabriela Zanfir-Fortuna (Eds.). ACM 607\u2013617. 10.1145\/3351095.3372850","DOI":"10.1145\/3351095.3372850"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403205"},{"key":"e_1_3_2_44_2","volume-title":"Proc. Conf. Fairness Accountability Transp., New York, USA","author":"Narayanan Arvind","year":"2018","unstructured":"Arvind Narayanan. 2018. Translation tutorial: 21 fairness definitions and their politics. In Proc. Conf. Fairness Accountability Transp., New York, USA."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2017.104"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407807"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40593-3_21"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3311955"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137789"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543873.3587561"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733009"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.14778\/3184470.3184475"},{"issue":"1","key":"e_1_3_2_54_2","first-page":"85","article-title":"A multivariate technique for multiply imputing missing values using a sequence of regression models","volume":"27","year":"2001","unstructured":"Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, and Peter Solenberger. 2001. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27, 1 (2001), 85\u201396.","journal-title":"Survey Methodology"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415556"},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the SysML Conference","author":"Schelter Sebastian","year":"2018","unstructured":"Sebastian Schelter, Joos-Hendrik B\u00f6se, Johannes Kirschnick, Thoralf Klein, Stephan Seufert, and Amazon. 2018. Declarative metadata management: A missing piece in end-to-end machine learning. In Proceedings of the SysML Conference."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2019.00161"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","unstructured":"Stefan Studer Thanh Binh Bui Christian Drescher Alexander Hanuschkin Ludwig Winkler Steven Peters and Klaus-Robert M\u00fcller. 2021. Towards CRISP-ML(Q): A machine learning process model with quality assurance methodology. Mach. Learn. Knowl. Extr. 3 2 (2021) 392\u2013413. 10.3390\/MAKE3020020","DOI":"10.3390\/MAKE3020020"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00215"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196934"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380571"},{"key":"e_1_3_2_64_2","volume-title":"Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016, Washington, D.C., USA, June 8-9, 2016","author":"Yan Zhepeng","year":"2016","unstructured":"Zhepeng Yan, Val Tannen, and Zachary G. Ives. 2016. Fine-grained provenance for linear algebra operators. In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016, Washington, D.C., USA, June 8-9, 2016. Sarah Cohen Boulakia (Ed.), USENIX Association."},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.3897\/tdwgproceedings.1.20380"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00025"}],"container-title":["ACM Transactions on Database Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3644385","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3644385","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:48Z","timestamp":1750287048000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3644385"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,10]]},"references-count":65,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3644385"],"URL":"https:\/\/doi.org\/10.1145\/3644385","relation":{},"ISSN":["0362-5915","1557-4644"],"issn-type":[{"value":"0362-5915","type":"print"},{"value":"1557-4644","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,10]]},"assertion":[{"value":"2023-01-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}