{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T03:16:37Z","timestamp":1758078997783,"version":"3.44.0"},"reference-count":22,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:p>\n            Modern data-driven systems often rely on complex pipelines to process and transform data for downstream machine learning (ML) tasks. Extracting these pipelines and understanding their structure is critical for ensuring transparency, performance optimization, and maintainability, especially in large-scale projects. In this work, we introduce a novel system, APEX-DAG (\n            <jats:bold>A<\/jats:bold>\n            utomating\n            <jats:bold>P<\/jats:bold>\n            ipeline\n            <jats:bold>EX<\/jats:bold>\n            traction with\n            <jats:bold>D<\/jats:bold>\n            ataflow, Static Code\n            <jats:bold>A<\/jats:bold>\n            nalysis, and\n            <jats:bold>G<\/jats:bold>\n            raph Attention Networks), which automates the extraction of data pipelines from computational notebooks or scripts. Unlike execution-based methods, APEX-DAG leverages static code analysis to identify the dataflow, transformations, and dependencies within ML workflows without executing the code or the need to alter the code. Further, after an initial training phase, our system can identify pipelines that built with previously unseen libraries.\n          <\/jats:p>","DOI":"10.14778\/3750601.3750675","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:05Z","timestamp":1758029885000},"page":"5375-5378","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["APEX-DAG: Library and Language independent Pipeline EXtraction"],"prefix":"10.14778","volume":"18","author":[{"given":"Sebastian","family":"Eggers","sequence":"first","affiliation":[{"name":"BIFOLD &amp; TU Berlin, Berlin, Germany"}]},{"given":"Nina","family":"\u017bukowska","sequence":"additional","affiliation":[{"name":"BIFOLD &amp; TU Berlin, Berlin, Germany"}]},{"given":"Ziawasch","family":"Abedjan","sequence":"additional","affiliation":[{"name":"BIFOLD &amp; TU Berlin, Berlin, Germany"}]}],"member":"320","published-online":{"date-parts":[[2025,9,16]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2025. Google Vertex AI. https:\/\/cloud.google.com\/vertex-ai Accessed: 2025-16-01."},{"key":"e_1_2_1_2_1","unstructured":"2025. Kaggle. https:\/\/www.kaggle.com\/ Accessed: 2025-16-01."},{"key":"e_1_2_1_3_1","unstructured":"2025. pandas. https:\/\/pandas.pydata.org\/ Accessed: 2025-16-01."},{"key":"e_1_2_1_4_1","unstructured":"2025. Polars. https:\/\/www.pola.rs\/ Accessed: 2025-16-01."},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"P. Buneman S. Khanna. and T. Wang-Chiew. 2001. Why and where: A characterization of data provenance. In ICDT.","DOI":"10.1007\/3-540-44503-X_20"},{"key":"e_1_2_1_6_1","unstructured":"JetBrains Datalore. 2020. We Downloaded 10 000 000 Jupyter Notebooks from GitHub \u2014 This Is What We Learned. https:\/\/blog.jetbrains.com\/datalore\/2020\/12\/17\/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned\/ Accessed: 2025-16-01."},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"A. Drozdova E. Trofimova P. Guseva A. Scherbakova and A. Ustyuzhanin. 2023. Code4ML: a large-scale dataset of annotated Machine Learning code. PeerJ Computer Science (2023).","DOI":"10.7717\/peerj-cs.1230"},{"key":"e_1_2_1_8_1","volume-title":"Automating Data Lineage and Pipeline Extraction. VLDB","author":"Eggers S.","year":"2024","unstructured":"S. Eggers. 2024. Automating Data Lineage and Pipeline Extraction. VLDB (2024)."},{"key":"e_1_2_1_9_1","volume-title":"MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. In SIGMOD. ACM.","author":"Grafberger S.","year":"2021","unstructured":"S. Grafberger, S. Guha, J. Stoyanovich, and S. Schelter. 2021. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. In SIGMOD. ACM."},{"key":"e_1_2_1_10_1","volume-title":"Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data Scientists. CHI.","author":"Harrison G.","year":"2024","unstructured":"G. Harrison, K. Bryson, A. E. Bamba, L. Dovichi, A. H. Binion, A. Borem, and B. Ur. 2024. JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data Scientists. CHI."},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"M. Helali N. Monjazeb S. Vashisth P. Carrier A. Helal A. Cavalcante K. Ammar K. Hose and E. Mansour. 2024. KGLiDS: A Platform for Semantic Abstraction Linking and Automation of Data Science. In ICDE. IEEE.","DOI":"10.1109\/ICDE60146.2024.00021"},{"key":"e_1_2_1_12_1","volume":"200","author":"Ikeda R.","unstructured":"R. Ikeda and J. Widom. 2009. Data lineage: A survey. Stanford University Publications (2009).","journal-title":"J. Widom."},{"key":"e_1_2_1_13_1","unstructured":"Z. G Ives and Y. Zhang. 2019. Dataset relationship management. In CIDR."},{"key":"e_1_2_1_14_1","volume-title":"Vamsa: Automated Provenance Tracking in Data Science Scripts. In SIGKDD. ACM.","author":"Namaki M. H.","year":"2020","unstructured":"M. H. Namaki, A. Floratou, F. Psallidas, S. Krishnan, A. Agrawal, Y. Wu, Y. Zhu, and M. Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In SIGKDD. ACM."},{"key":"e_1_2_1_15_1","unstructured":"Basel Committee on Banking Supervision. 2013. Principles for Effective Risk Data Aggregation and Risk Reporting (BCBS 239). Technical Report. Bank for International Settlements. https:\/\/www.bis.org\/publ\/bcbs239.htm Accessed: 2025-01-26."},{"key":"e_1_2_1_16_1","unstructured":"J. Peuralinna. 2024. Data Lineage in the financial sector. Master's Thesis. Aalto University."},{"key":"e_1_2_1_17_1","volume-title":"noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. VLDB","author":"Pimentel J. F.","year":"2017","unstructured":"J. F. Pimentel, L. Murta, V. Braganholo, and Juliana Freire. 2017. noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. VLDB (2017)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"F. Psallidas M. E. Leszczynski M. H. Namaki A. Floratou A. Agrawal K. Karanasos S. Krishnan P. Subotic M. Weimer Y. Wu and Y. Zhu. 2023. Demonstration of Geyser: Provenance Extraction and Applications over Data Science Scripts. In SIGMOD. ACM.","DOI":"10.1145\/3555041.3589717"},{"key":"e_1_2_1_19_1","unstructured":"S. Schelter J. B\u00f6se J. Kirschnick T. Klein and S. Seufert. 2017. Automatically tracking metadata and provenance of machine learning experiments. NeurIPS."},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"S. Schelter S. Guha and S. Grafberger. 2024. Automated Provenance-Based Screening of ML Data Preparation Pipelines. Datenbank-Spektrum (2024).","DOI":"10.1007\/s13222-024-00483-4"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"D. Xin H. Miao A. G. Parameswaran and N. Polyzotis. 2021. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. In SIGMOD. ACM.","DOI":"10.1145\/3448016.3457566"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Y. Zhang and Z. G Ives. 2020. Finding related tables in data lakes for interactive data science. In SIGMOD.","DOI":"10.1145\/3318464.3389726"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3750601.3750675","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:42:50Z","timestamp":1758030170000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3750601.3750675"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8]]},"references-count":22,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["10.14778\/3750601.3750675"],"URL":"https:\/\/doi.org\/10.14778\/3750601.3750675","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8]]},"assertion":[{"value":"2025-09-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}