{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T20:36:32Z","timestamp":1780346192966,"version":"3.54.1"},"reference-count":86,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b) extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce OneProvenance: an efficient provenance extraction system from query event logs. OneProvenance addresses the unique challenges of log-based extraction by (a) identifying query execution dependencies through efficient log analysis, (b) extracting provenance through novel event transformations that account for query dependencies, and (c) introducing effective filtering optimizations. Our thorough experimental analysis shows that OneProvenance can improve extraction by up to ~18X compared to state-of-the-art baselines; our optimizations reduce the extraction noise and optimize performance even further. OneProvenance is deployed at scale by Microsoft Purview and actively supports customer provenance extraction needs (https:\/\/bit.ly\/3N2JVGF).<\/jats:p>","DOI":"10.14778\/3611540.3611555","type":"journal-article","created":{"date-parts":[[2023,9,15]],"date-time":"2023-09-15T11:32:37Z","timestamp":1694777557000},"page":"3662-3675","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event Logs"],"prefix":"10.14778","volume":"16","author":[{"given":"Fotis","family":"Psallidas","sequence":"first","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ashvin","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chandru","family":"Sugunan","sequence":"additional","affiliation":[{"name":"Snowflake"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Khaled","family":"Ibrahim","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Konstantinos","family":"Karanasos","sequence":"additional","affiliation":[{"name":"Meta"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jes\u00fas","family":"Camacho-Rodr\u00edguez","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Avrilia","family":"Floratou","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Carlo","family":"Curino","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Raghu","family":"Ramakrishnan","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3524284"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816740"},{"key":"e_1_2_1_4_1","unstructured":"acryldata 2023. Acryl Data. https:\/\/www.acryldata.io\/."},{"key":"e_1_2_1_5_1","unstructured":"adaptive 2022. Adaptive. https:\/\/adaptive.com."},{"key":"e_1_2_1_6_1","unstructured":"alation 2022. Alation. https:\/\/alation.com."},{"key":"e_1_2_1_7_1","unstructured":"alex 2022. Alex Solutions. https:\/\/alexsolutions.com.au."},{"key":"e_1_2_1_8_1","volume-title":"Putting lipstick on pig: Enabling database-style workflow provenance. arXiv preprint arXiv:1201.0231","author":"Amsterdamer Yael","year":"2011","unstructured":"Yael Amsterdamer, Susan B Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011. Putting lipstick on pig: Enabling database-style workflow provenance. arXiv preprint arXiv:1201.0231 (2011)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2021.101846"},{"key":"e_1_2_1_10_1","unstructured":"asg 2022. ASG. https:\/\/www.asg.com."},{"key":"e_1_2_1_11_1","unstructured":"atlas 2019. Apache Atlas - Type System. https:\/\/atlas.apache.org\/#\/TypeSystem."},{"key":"e_1_2_1_12_1","unstructured":"awsdatazone 2022. AWS DataZone. https:\/\/aws.amazon.com\/datazone\/."},{"key":"e_1_2_1_13_1","unstructured":"azurestorage 2022. Azure Storage. https:\/\/azure.microsoft.com\/services\/storage."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.1979.234182"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/369275.369289"},{"key":"e_1_2_1_16_1","volume-title":"Wang Chiew Tan, and Gaurav Vijayvargiya","author":"Bhagwat Deepavali","year":"2004","unstructured":"Deepavali Bhagwat, Laura Chiticariu, Wang Chiew Tan, and Gaurav Vijayvargiya. 2004. An Annotation Management System for Relational Databases. In VLDB. 900--911."},{"key":"e_1_2_1_17_1","volume-title":"Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2015\/Papers\/CIDR15_Paper18","author":"Bhardwaj Anant P.","unstructured":"Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2015\/Papers\/CIDR15_Paper18.pdf"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824035"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209900.3209911"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687750"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142574"},{"key":"e_1_2_1_22_1","unstructured":"ccpa 2022. California Consumer Privacy Act (CCPA). https:\/\/oag.ca.gov\/privacy\/ccpa."},{"key":"e_1_2_1_23_1","volume-title":"Provenance in databases: Why, how, and where. Foundations and Trends\u00ae in Databases 1, 4","author":"Cheney James","year":"2009","unstructured":"James Cheney, Laura Chiticariu, and Wang Chiew Tan. 2009. Provenance in databases: Why, how, and where. Foundations and Trends\u00ae in Databases 1, 4 (2009), 379--474."},{"key":"e_1_2_1_24_1","unstructured":"colibra 2022. Colibra. https:\/\/colibra.com."},{"key":"e_1_2_1_25_1","unstructured":"createeventsession 2022. Create Event Session. https:\/\/docs.microsoft.com\/en-us\/sql\/t-sql\/statements\/create-event-session-transact-sql?view=sql-server-ver15."},{"key":"e_1_2_1_26_1","unstructured":"Yingwei Cui. 2001. Lineage tracing in data warehouses. Ph.D. Dissertation. Stanford University."},{"key":"e_1_2_1_27_1","volume-title":"ICEIS 2008 - Proceedings of the Tenth International Conference on Enterprise Information Systems","volume":"16","author":"Curino Carlo","year":"2008","unstructured":"Carlo Curino, Hyun Jin Moon, Letizia Tanca, and Carlo Zaniolo. 2008. Schema Evolution in Wikipedia - Toward a Web Information System Benchmark. In ICEIS 2008 - Proceedings of the Tenth International Conference on Enterprise Information Systems, Volume DISI, Barcelona, Spain, June 12--16, 2008, Jos\u00e9 Cordeiro and Joaquim Filipe (Eds.). 323--332."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453939"},{"key":"e_1_2_1_29_1","unstructured":"dag 2022. Data Advantage Group. https:\/\/www.dag.com."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376702"},{"key":"e_1_2_1_31_1","unstructured":"datahub 2023. DataHub. https:\/\/datahubproject.io\/."},{"key":"e_1_2_1_32_1","unstructured":"datakin 2022. Datakin. https:\/\/datakin.com."},{"key":"e_1_2_1_33_1","unstructured":"dataworld 2022. data.world. https:\/\/data.world."},{"key":"e_1_2_1_34_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2017\/papers\/p44-deng-cidr17.pdf"},{"key":"e_1_2_1_35_1","unstructured":"egeria-lineage 2022. Egeria - Lineage Management. https:\/\/egeria-project.org\/features\/lineage-management\/overview\/#lineage-styles."},{"key":"e_1_2_1_36_1","unstructured":"egeria-lineage 2022. OpenLineage - Lineage Management. https:\/\/openlineage.io\/."},{"key":"e_1_2_1_37_1","unstructured":"erwin 2022. erwin. https:\/\/www.erwin.com."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2004.10.033"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899391"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00094"},{"key":"e_1_2_1_41_1","unstructured":"gartner-governance 2020. Gartner Report on Metadata Management Solutions. https:\/\/www.gartner.com\/en\/documents\/3993025."},{"key":"e_1_2_1_42_1","unstructured":"gdc 2022. Google Data Catalog. https:\/\/cloud.google.com\/data-catalog."},{"key":"e_1_2_1_43_1","unstructured":"gdpr 2022. General Data Protection Regulation (EU GDPR). https:\/\/gdprinfo.eu."},{"key":"e_1_2_1_44_1","volume-title":"Perm: Processing provenance and data on the same data model through query rewriting. In ICDE.","author":"Glavic Boris","year":"2009","unstructured":"Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE."},{"key":"e_1_2_1_45_1","unstructured":"globalids 2022. Global IDs. https:\/\/www.globalids.com."},{"key":"e_1_2_1_46_1","unstructured":"Achim Granzen. 2021. Data Governance Solutions. https:\/\/www.forrester.com\/report\/the-forrester-wave-tm-data-governance-solutions-q3-2021\/RES161533."},{"key":"e_1_2_1_47_1","unstructured":"Todd J. Green Grigoris Karvounarakis Zachary G. Ives and Val Tannen. 2007. Update Exchange with Mappings and Provenance. In VLDB."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457402"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_50_1","unstructured":"hammerdb 2022. HammerDB. https:\/\/www.hammerdb.com."},{"key":"e_1_2_1_51_1","unstructured":"hammerdbgithub 2022. HammerDB on GitHub. https:\/\/github.com\/TPC-Council\/HammerDB."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/4229.4233"},{"key":"e_1_2_1_53_1","volume-title":"Ground: A Data Context Service. In 8th Biennial Conference on Innovative Data Systems Research, CIDR","author":"Hellerstein Joseph M.","year":"2017","unstructured":"Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, Mark Donsky, Gabriel Fierro, Chang She, Carl Steinbach, Venkat Subramanian, and Eric Sun. 2017. Ground: A Data Context Service. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2017\/papers\/p111-hellerstein-cidr17.pdf"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-017-0486-1"},{"key":"e_1_2_1_55_1","volume-title":"Taverna: a tool for building and running workflows of services. Nucleic acids research 34, suppl_2","author":"Hull Duncan","year":"2006","unstructured":"Duncan Hull, Katy Wolstencroft, Robert Stevens, Carole Goble, Mathew R Pocock, Peter Li, and Tom Oinn. 2006. Taverna: a tool for building and running workflows of services. Nucleic acids research 34, suppl_2 (2006), W729--W732."},{"key":"e_1_2_1_56_1","unstructured":"Robert Ikeda. 2012. Provenance In Data-Oriented Workflows. Ph.D. Dissertation. Stanford University."},{"key":"e_1_2_1_57_1","unstructured":"informatica 2022. Informatica. https:\/\/www.informatica.com."},{"key":"e_1_2_1_58_1","unstructured":"infosphere 2022. IBM Infosphere. https:\/\/www.ibm.com\/analytics\/information-server."},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Udo Kruschwitz Charlie Hull et al. 2017. Searching the enterprise. Vol. 11. Now Publishers.","DOI":"10.1561\/9781680833058"},{"key":"e_1_2_1_60_1","volume-title":"Scientific workflow management and the Kepler system. Concurrency and computation: Practice and experience 18, 10","author":"Lud\u00e4scher Bertram","year":"2006","unstructured":"Bertram Lud\u00e4scher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A Lee, Jing Tao, and Yang Zhao. 2006. Scientific workflow management and the Kepler system. Concurrency and computation: Practice and experience 18, 10 (2006), 1039--1065."},{"key":"e_1_2_1_61_1","volume-title":"Boris Asipov, and Philippe Cudr\u00e9-Mauroux.","author":"Mavlyutov Ruslan","year":"2017","unstructured":"Ruslan Mavlyutov, Carlo Curino, Boris Asipov, and Philippe Cudr\u00e9-Mauroux. 2017. Dependency-Driven Analytics: A Compass for Uncharted Data Oceans.. In CIDR."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3077257.3077267"},{"key":"e_1_2_1_63_1","first-page":"59","article-title":"Making Open Data Transparent: Data Discovery on Open Data","volume":"41","author":"Miller Ren\u00e9e J.","year":"2018","unstructured":"Ren\u00e9e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull. 41, 2 (2018), 59--70. http:\/\/sites.computer.org\/debull\/A18june\/p59.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452478"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_67_1","volume-title":"An Introduction to Duplicate Detection","author":"Naumann Felix","unstructured":"Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan and Claypool Publishers."},{"key":"e_1_2_1_68_1","volume-title":"Vasudha Krishnaswamy, and Venkatesh Radhakrishnan.","author":"Niu Xing","year":"2017","unstructured":"Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, and Venkatesh Radhakrishnan. 2017. Provenance-aware Query Optimization. In ICDE."},{"key":"e_1_2_1_69_1","volume-title":"Ibis: A Provenance Manager for Multi-Layer Systems.. In CIDR. 152--159.","author":"Olston Christopher","year":"2011","unstructured":"Christopher Olston and Anish Das Sarma. 2011. Ibis: A Provenance Manager for Multi-Layer Systems.. In CIDR. 152--159."},{"key":"e_1_2_1_70_1","unstructured":"precisely 2022. Precisely. https:\/\/www.precisely.com."},{"key":"e_1_2_1_71_1","volume-title":"Carlo Curino, and Raghu Ramakrishnan","author":"Psallidas Fotis","year":"2022","unstructured":"Fotis Psallidas, Ashvin Agrawal, Chandru Sugunan, Khaled Ibrahim, Konstantinos Karanasos, Jes\u00fas Camacho-Rodr\u00edguez, Avrilia Floratou, Carlo Curino, and Raghu Ramakrishnan. 2022. OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]. arXiv preprint arXiv:2210.14047 (2022)."},{"key":"e_1_2_1_72_1","volume-title":"Smoke: Fine-grained lineage at interactive speed. arXiv preprint arXiv:1801.07237","author":"Psallidas Fotis","year":"2018","unstructured":"Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained lineage at interactive speed. arXiv preprint arXiv:1801.07237 (2018)."},{"key":"e_1_2_1_73_1","unstructured":"purview 2022. Microsoft Purview. https:\/\/azure.microsoft.com\/en-us\/services\/purview."},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415556"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigComp.2018.00080"},{"key":"e_1_2_1_76_1","unstructured":"semanticweb 2022. Semantic Web Company. https:\/\/semantic-web.com."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/VL.1996.545307"},{"key":"e_1_2_1_78_1","unstructured":"smartlogic 2022. Smartlogic. https:\/\/www.smartlogic.com."},{"key":"e_1_2_1_79_1","unstructured":"sqldb 2022. Azure SQL Managed Instance. https:\/\/azure.microsoft.com\/en-us\/products\/azure-sql\/managed-instance."},{"key":"e_1_2_1_80_1","unstructured":"syniti 2022. Syniti. https:\/\/www.syniti.com."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00215"},{"key":"e_1_2_1_82_1","unstructured":"Jennifer Widom. 2005. Trio: a system for integrated management of data accuracy and lineage. In CIDR."},{"key":"e_1_2_1_83_1","unstructured":"Allison Woodruff and Michael Stonebraker. 1997. Supporting fine-grained data lineage in a database visualization environment. In ICDE."},{"key":"e_1_2_1_84_1","unstructured":"xevents 2019. Extended Events Overview. https:\/\/docs.microsoft.com\/en-us\/sql\/relational-databases\/extended-events\/extended-events."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3611540.3611555","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T22:32:26Z","timestamp":1757543546000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3611540.3611555"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":86,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.14778\/3611540.3611555"],"URL":"https:\/\/doi.org\/10.14778\/3611540.3611555","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,8]]},"assertion":[{"value":"2023-08-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}