{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T03:18:41Z","timestamp":1758079121352,"version":"3.44.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,8]]},"abstract":"<jats:p>Data discovery has gained significant traction in the database community resulting in various discovery operations, index schemes, and discovery systems. This tutorial explores the architecture and components of data discovery systems, focusing on indexing structures and scalable algorithms for typical operations, such as join and union discovery. While giving insights into individual algorithms, we point out open challenges for holistic systems, data discovery evaluation, and discovery in federated setups.<\/jats:p>","DOI":"10.14778\/3750601.3750694","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:38:05Z","timestamp":1758029885000},"page":"5455-5459","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Data Discovery in Data Lakes: Operations, Indexes, Systems"],"prefix":"10.14778","volume":"18","author":[{"given":"Ziawasch","family":"Abedjan","sequence":"first","affiliation":[{"name":"BIFOLD &amp; TU Berlin, Berlin, Germany"}]},{"given":"Mahdi","family":"Esmailoghli","sequence":"additional","affiliation":[{"name":"HU Berlin, Berlin, Germany"}]},{"given":"Sainyam","family":"Galhotra","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,9,16]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Ziawasch Abedjan John Morcos Ihab F. Ilyas Mourad Ouzzani Paolo Papotti and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In ICDE. 1134\u20131145.","DOI":"10.1109\/ICDE.2016.7498319"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482322"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457403"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/829502.830043"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/gongshow2020\/gongshow\/abstracts\/cidr2020_abstract86","author":"Esmailoghli Mahdi","year":"2020","unstructured":"Mahdi Esmailoghli and Ziawasch Abedjan. 2020. CAFE: Constraint-Aware Feature Extraction from Large Databases. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/gongshow2020\/gongshow\/abstracts\/cidr2020_abstract86.pdf"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/edbt.2021.30"},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","first-page":"1684","DOI":"10.14778\/3529337.3529353","article-title":"MATE: Multi-Attribute Table Extraction","volume":"15","author":"Esmailoghli Mahdi","year":"2022","unstructured":"Mahdi Esmailoghli, Jorge-Arnulfo Quian\u00e9-Ruiz, and Ziawasch Abedjan. 2022. MATE: Multi-Attribute Table Extraction. Proceedings of the VLDB Endowment (PVLDB) 15, 8 (2022), 1684\u20131696. https:\/\/www.vldb.org\/pvldb\/vol15\/p1684-esmailoghli.pdf","journal-title":"Proceedings of the VLDB Endowment (PVLDB)"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE65448.2025.00061"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3587136.3587146"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00094"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2948674.2948675"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00109"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2731084"},{"key":"e_1_2_1_16_1","first-page":"237","article-title":"Data Lake Organization","volume":"35","author":"Nargesian Fatemeh","year":"2023","unstructured":"Fatemeh Nargesian, Ken Q. Pu, Bahar Ghadiri Bashardoost, Erkang Zhu, and Ren\u00e9e J. Miller. 2023. Data Lake Organization. IEEE Trans. Knowl. Data Eng. 35, 1 (2023), 237\u2013250.","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"e_1_2_1_17_1","volume-title":"Bahar Ghadiri Bashardoost, and Ren\u00e9e J. Miller","author":"Nargesian Fatemeh","year":"2018","unstructured":"Fatemeh Nargesian, Ken Q. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, and Ren\u00e9e J. Miller. 2018. Optimizing Organizations for Navigating Data Lakes. CoRR abs\/1812.07024 (2018). arXiv:1812.07024"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380605"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","first-page":"2863","DOI":"10.14778\/3476311.3476364","article-title":"RONIN: Data Lake Exploration","volume":"14","author":"Ouellette Paul","year":"2021","unstructured":"Paul Ouellette, Aidan Sciortino, Fatemeh Nargesian, Bahar Ghadiri Bashardoost, Erkang Zhu, Ken Pu, and Ren\u00e9e J. Miller. 2021. RONIN: Data Lake Exploration. Proceedings of the VLDB Endowment (PVLDB) 14, 12 (2021), 2863\u20132866. http:\/\/www.vldb.org\/pvldb\/vol14\/p2863-nargesian.pdf","journal-title":"Proceedings of the VLDB Endowment (PVLDB)"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.48786\/EDBT.2024.87"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3458456"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00264"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213962"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML 2017","volume":"70","author":"Shrivastava Anshumali","year":"2017","unstructured":"Anshumali Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6\u201311 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 3154\u20133163. http:\/\/proceedings.mlr.press\/v70\/shrivastava17a.html"},{"key":"e_1_2_1_26_1","volume-title":"Visual Bayesian fusion to navigate a data lake","author":"Singh Karamjit","unstructured":"Karamjit Singh, Kaushal Paneri, Aditeya Pandey, Garima Gupta, Geetika Sharma, Puneet Agarwal, and Gautam Shroff. 2016. Visual Bayesian fusion to navigate a data lake. In FUSION. IEEE, 987\u2013994."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3494124.3494149"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2247596.2247642"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.111"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000825"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213848"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331385"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352095"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389726"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300065"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3750601.3750694","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:39:27Z","timestamp":1758029967000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3750601.3750694"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8]]},"references-count":36,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,8]]}},"alternative-id":["10.14778\/3750601.3750694"],"URL":"https:\/\/doi.org\/10.14778\/3750601.3750694","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8]]},"assertion":[{"value":"2025-09-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}